Theta

16 posts

Theta banner
Theta

Theta

@trytheta

Specialized AI for Every Job

가입일 Mayıs 2025
3 팔로잉788 팔로워
고정된 트윗
Theta
Theta@trytheta·
Introducing CUB: Humanity's Last Exam for Computer and Browser Use Agents
Theta tweet media
English
32
39
251
113.3K
Theta
Theta@trytheta·
Why Now? (4/4) AI-first browsers are poised to disrupt the massive web browser market, with highly anticipated releases like Comet from @perplexity_ai on the way. It's yet to be seen how Google integrates Project Mariner and other AI tools within Chrome.
English
1
1
16
2.5K
Theta
Theta@trytheta·
Browser agents use computers the same way humans do, unlocking powerful use cases for personal assistants, browsers, and enterprise workflows. After talking to 20+ founders in the space, we're excited to put out the definitive market map for browser agents.
Theta tweet media
English
28
86
586
102.8K
Theta
Theta@trytheta·
The Theta team started CUB as an internal evalset, but it quickly grew into a full-fledged benchmark over the past month. We're excited to test even more models and frameworks. For more on the benchmark, including examples and a full paper, check out our blog: thetasoftware.ai/#/blog/introdu…
English
1
0
19
2.2K
Theta
Theta@trytheta·
Computer/browser use agents still have a long way to go for more complex, end-to-end workflows. Actual task completion is far below our reported numbers: we gave credit for partially correct solutions and reaching key checkpoints. In total, there were less than 10 instances across our thousands of runs where an agent successfully completed a full task.
English
1
0
18
2.9K
Theta
Theta@trytheta·
Introducing CUB: Humanity's Last Exam for Computer and Browser Use Agents
Theta tweet media
English
32
39
251
113.3K
Theta 리트윗함
Gurvir Singh
Gurvir Singh@_gurvir_·
we've been misled to believe that manual prompt hacking is the solution to teaching LLMs how to approach complex problems. why write a "magic prompt" to pattern match for every type of problem you might care about, when LLMs have already shown extraordinary ability to self-review and self-correct given the right feedback loops @karpathy alludes to it here, but what's missing is a memory layer so that LLMs can learn from their previous mistakes. they suffer from amnesia because they lack a mechanism to record and build upon problem solving strategies. a memory layer allows for this "system prompt learning" instead of relying on explicit human feedback there's a lot of engineering challenges in getting this to work effectively. how do you measure which insights are effective, and how do you refine them from feedback? building a "scratchpad" of notes that can be maintained over thousands of runs and indexed efficiently to get the right notes is a non-trivial problem, and it's exactly what we're tackling at @trytheta
Andrej Karpathy@karpathy

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human learning feels more like a change in system prompt. You encounter a problem, figure something out, then "remember" something in fairly explicit terms for the next time. E.g. "It seems when I encounter this and that kind of a problem, I should try this and that kind of an approach/solution". It feels more like taking notes for yourself, i.e. something like the "Memory" feature but not to store per-user random facts, but general/global problem solving knowledge and strategies. LLMs are quite literally like the guy in Memento, except we haven't given them their scratchpad yet. Note that this paradigm is also significantly more powerful and data efficient because a knowledge-guided "review" stage is a significantly higher dimensional feedback channel than a reward scaler. I was prompted to jot down this shower of thoughts after reading through Claude's system prompt, which currently seems to be around 17,000 words, specifying not just basic behavior style/preferences (e.g. refuse various requests related to song lyrics) but also a large amount of general problem solving strategies, e.g.: "If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step." This is to help Claude solve 'r' in strawberry etc. Imo this is not the kind of problem solving knowledge that should be baked into weights via Reinforcement Learning, or least not immediately/exclusively. And it certainly shouldn't come from human engineers writing system prompts by hand. It should come from System Prompt learning, which resembles RL in the setup, with the exception of the learning algorithm (edits vs gradient descent). A large section of the LLM system prompt could be written via system prompt learning, it would look a bit like the LLM writing a book for itself on how to solve problems. If this works it would be a new/powerful learning paradigm. With a lot of details left to figure out (how do the edits work? can/should you learn the edit system? how do you gradually move knowledge from the explicit system text to habitual weights, as humans seem to do? etc.).

English
4
4
29
3.1K