Collinear AI

96 posts

Collinear AI banner
Collinear AI

Collinear AI

@CollinearAI

The AI Simulation Lab

Katılım Ekim 2023
43 Takip Edilen443 Takipçiler
Collinear AI
Collinear AI@CollinearAI·
We discovered significant gaps between open and closed sourced models on our realistic computer-use-agent tasks, and it is a data problem. Although open models have nearly saturated OSWorld, we found that kimi k2.6 cannot do tasks that GPT-5.4 solves in 50 steps. Our 30 tasks are realistic: the agent works with an open source version of Office Suit in an linux OS, and compiles excel sheets. GPT-5.4-high solves 2/3 in 25 steps, and 1/3 in 50 steps. Kimi k2.6, the strongest open model on OSWorld, fails almost all of them. We understand the problem to be very simple: open models simply are not trained on realistic CUA data enough. To test this hypothesis, we simply RL-ed Kimi K2.6 on 10 in-domain CUA office tasks with LoRA. The result of the simplistic RL is a significant increase of +30% in the capacity to do office tasks. However, the improvement gracefully carries over to OSWorld itself: on a stratified subset of 30 tasks, the RL-ed model sees another +10% lift. The takeaway from our initial results is that CUA models suffer from unrealistic, low-quality data. As a result, we are continually building realistic apps / RL environments to bridge the gap. More to come. Solid work done by @alckasoc
Collinear AI tweet media
English
3
3
12
4.1K
Collinear AI retweetledi
Soumyadeep Bakshi
Soumyadeep Bakshi@soumyadeepb_·
Who says agents are only built in the Bay? At our last Collinear Dinner Series in NYC, we sat down with frontier builders from IBM Research, NVIDIA, BNP Paribas, Two Sigma, Wells Fargo, Datadog, 2OS and others. Making agents function reliably within enterprises is an underrated challenge right now. Long-horizon workflows, compliance requirements, messy data, and several compounding factors are open problems. Probably different problems from the Valley, but equally frontier and (sometimes) harder. NY had a lot to say this round. The next dinner's in the Bay. DM for the secret invite. @sachpatro97
Soumyadeep Bakshi tweet media
English
0
2
4
616
Collinear AI retweetledi
Muyu He
Muyu He@HeMuyu0327·
We are hosting a fun researcher event at our Sunnyvale office this Thursday. Come to play mini hoop, grab meals, shoot nerf guns, and join researchers to discuss one of the most important questions in post training right now: how to really build simulations for RL? As we are focusing on building realistic RL environments to power frontier agents, we and our guests have a lot to share and we are excited to hear your thoughts as well. It's gonna be 50% knowledge and 50% vibes / महफ़िल / 氛围. Come to have fun! Register at: luma.com/152rukwj
English
0
4
11
667
Collinear AI
Collinear AI@CollinearAI·
Agents don't fail like language models do. It's not about a single bad output, it's the result of compounding mistakes across a control loop. Step 3 is where it starts. Step 8 is where it surfaces. By then, a real user is on the other end. SimLab, a CLI for simulating AI agents before they hit production, now has a public SDK. Give it a try → github.com/collinear-ai/s…
Collinear AI tweet media
English
0
1
4
240
Collinear AI
Collinear AI@CollinearAI·
We gave 12 frontier AI models $200K and told them to run a startup for a year. 9 of them went bankrupt. YC-Bench, built with @HuggingFace, puts LLM agents in the CEO seat: contracts, hiring, cash flow, adversarial clients. The winners didn't reason better. They remembered better. Open source. Paper on arXiv, dataset and leaderboard on Hugging Face. huggingface.co/blog/collinear…
Collinear AI tweet media
English
1
4
10
880
Collinear AI
Collinear AI@CollinearAI·
Opus reasons better. But it assumes where it should discover. With 62 tools, explore first, execute second. @AnthropicAI, this feels very fixable. Try out our SimLab to test & train your agent! 3/3
English
0
0
2
106
Collinear AI
Collinear AI@CollinearAI·
But it never called list_accounts to check whose calendars exist. Just assumed the default was right. Wrong account. The interviewer never saw the event. GPT-4o? First thing it did - list_accounts. Found the right person. Done. 2/3
English
1
0
1
121
Collinear AI
Collinear AI@CollinearAI·
Claude, Opus 4.6, completed 13 steps of near-perfect autonomous work and still failed the task due to a single missed tool call. We tested it in real-world workplace simulations with 62 tools in our SimLab. It outworked GPT-4o, honestly. Checked inbox, emailed both parties, logged everything. 1/3
English
2
0
7
307
Collinear AI retweetledi
Anand Kumar
Anand Kumar@anand_k27·
Our TraitBasis paper got accepted as ACL 2026 Oral! 🎉 I learned a lot about industrial research alongside @HeMuyu0327. Synopsis AI agents ace benchmarks but fall apart when users get impatient, vague, or skeptical. TraitBasis steers user traits in activation space to stress-test agents, no fine-tuning needed. Frontier models drop 4–20% on τ-Trait. 📄 arxiv.org/abs/2510.04491 💻 github.com/collinear-ai/t… PS: Check out SimLab @CollinearAI to train and improve your agents on these scenarios.
English
3
2
23
3.7K
Collinear AI
Collinear AI@CollinearAI·
BREAKING: Meek Mill's new AI startup is simulation maxxing with Collinear's SimLab. "I don't need a team, now my agent can handle client deals," the rapper told sources.
Collinear AI tweet mediaCollinear AI tweet media
English
1
1
6
1.3K
Collinear AI
Collinear AI@CollinearAI·
@HeMuyu0327 Massive respect to the greedy-bot for keeping the standards high.
English
0
0
0
49
Muyu He
Muyu He@HeMuyu0327·
In the process of developing our benchmark, YC-Bench, we at @CollinearAI find valuable strategic thinkings by different models when in danger of their company going bankrupt. Overall YC-Bench is a tough challenge, and tests skills inaccessible in other benchmarks. - What challenges did we give to the models: Running tasks with different difficulties and rewards. Some tasks require a high "prestige" in certain domains (training, infra, etc.), and only when the model works its way up in those domains can it unlock those lucrative tasks. So the model must plan long term to specialize in at least some domains. Avoiding unreliable clients. Clients are not always good for business, and some rat on the company. The model must spot them early, or pay a large price for it. Establish good client relationship with good clients. If establishing a strong relationship with a good client, the company can complete tasks faster and be able to work on more stuff, but this needs consistency, and must forsake relationship with other clients. So the model needs to constantly make a choice. Bleeding monthly expenses. As the company gets better, the employees become more expensive. Some strategies grow the company but cannot cover the cost of running it, while others balance employee improvements well. - How did different frontier models perform (they are very different!): GPT-5.4 did at least three things right. (1) It is locked in with trust-worthy clients, and mostly does work for them. This gives it an advantage to unlock good tasks fast. No other model really does that. (2) It spots bad clients before they rat on it, by noticing change in the contract, etc. (3) It deliberately puts employees with domain specialization to tasks with matching domain, so overall efficiency is higher. GLM-5-turbo never really did those three, but 2 out of 3 seeds it outperforms GPT-5.4. The single reason: it currently identifies that spreading employees on a LOT of tasks increase total throughput, even at the cost of slower progress for each task. This is kind of a bug we are fixing in development, but this hidden hack is only visible to GLM, and it exploits it to the fullest. Kimi K2.5 performs more modestly, but on seed 3 it uses a very unique strategy: specializing in one domain only (data). This means it is walking a narrow but secure path, despite some penalty early on. Using this strategy it is able to survive seed 3 which is the hardest by model outcomes. Qwen3.5 397B remains the underdog here. It adopts none of the strategies here and goes bankrupt on all three seeds. Besides these models we also provide a greedy baseline, which always assigning all employees to one task with highest reward. This is a baseline that does not need to think in long-term, and it fails, showing the benchmark is a realistic model for long-term planning tasks. - Where do we go from here? We are making the benchmark even more realistic, including: (1) Can the model infer the employee productivity, task domains, etc. from natural language descriptions, so it needs to do more estimation and fewer calculations? (2) Can the model improves by running multiple times? And (3) adding results from more models include Claude and Gemini. Overall we are having a lot of fun with this benchmark and we want you to have fun too. YC-Bench is human-playable, so go to our github and see if you can beat the AIs. You might be the John Connor we need all along.
Muyu He tweet mediaMuyu He tweet media
English
4
6
20
2.6K
Collinear AI
Collinear AI@CollinearAI·
Launching soon.
English
2
0
7
1.2K
Collinear AI retweetledi
Muyu He
Muyu He@HeMuyu0327·
Interested in how @Kimi_Moonshot 's kimi linear attention (KDA) "improves" linear attention, I break down the math to show how it evolves all the way from the most basic version. Linear attention can be seen in two perspectives: - On the one hand the linear "fast memory" matrix is a sum of all value vectors in the context weighted by the key vectors, so when it is multiplied by the query vector, the outcome is just a non-softmax version of attention computed linearly. - On the other hand the same construction of the fast memory can be seen as the gradient descent of a particular loss function. This is where people can choose different losses to make fast memory more powerful. The choice of loss function to optimize: - Naively the construction is equivalent to optimizing the correlation (inner product) between the latest key vector and the latest value vector. This means the fast memory will help the query vector "find" the right value vector given that it matches the right key vector. - A step forward is optimizing the "reconstruction" (MSE) between the key and the value. This means the query vector's search for the right value vector can be even more accurate. This is DeltaNet. - The most recent step is to add a scalar to the old fast memory so that as we optimize the search for the new value, we gradually "forget" the influence of previous values in the attention. This is Gated DeltaNet. The innovation of KDA: - A scalar value will demolish the influence of previous key-value pairs across all dimensions, but we want a fine-grained gate to control the forgetting each dimension. - So KDA introduces a diagonal matrix gate for each head, so that each dimension of the head will be assigned a forgetting scalar. - This diagonal gate is built from the linear combination of a token-dependent gate (g) to control "how much to forget in this dim for this token) and a token-independent gate (A) to control "how much to forget in this dim in general). In reality they have done some optimizations to make the inference much faster, such as reformatting the update rule in chunks and materializing the diagonal matrix as a vector. But the core math is just to make surgical choices to forget previous information along semantically meaningful heads/dimensions. Pretty cool!
Muyu He tweet mediaMuyu He tweet media
English
7
83
651
36.1K