Quarq (@quarqlabs) - Twitter Profili | Zamantika Mersobahis Locabet

Quarq retweetledi

vaibhav@0xVK__·13h

Human brains already know a lot about the universe when they are born, they don’t start from zero. Pre-training + RL is somewhat analogous to how evolution trains our brains. What’s missing in AI agents is what happens after that. The ability to quickly learn, adapt, and improve. Humans are highly sample efficient. Fine-tuning is also relatively sample efficient compared to pre-training, but still nowhere close to the efficiency of our brains. Context windows do match the sample efficiency of the brain. Models can learn how to respond in a few shots. However, context windows don’t scale. After a week of accumulating context, it becomes too large, and context rot starts degrading performance. I think of continual learning as either a completely different learning method (apart from pre-training, RL, fine-tuning, or in-context learning), or a combination of these orchestrated in a sophisticated way. Our initial prototype at @quarqlabs is based on the second approach. One way or another, we will solve continual learning!

Quarq@quarqlabs

Modern AI agents must learn continually from new tasks and data. Unlike static models trained once on fixed data, agents deployed in the real world encounter evolving environments. Without continual learning, an agent's knowledge becomes stale and counterproductive. The biggest gap between AI agents and human intelligence is the ability to learn. Humans continually learn and improve over time, acquire new skills, correct their past mistakes. In contrast, most AI agents have an incredible amount of world knowledge, but do not meaningfully get better over time. Research has converged on a critical insight: there are fundamentally two ways an AI agent can learn in-weights and in-context. Traditionally, the concept of "continual learning" for neural networks has been synonymous with weight updates. But a modern LLM agent is defined not just by model weights θ, but by the pair (θ, C), where C is the context window. This opens up a second axis: rather than updating weights, we can update the tokens that condition the model's behavior what Letta calls "learning in token space.” Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents often stray away from their original tasks. A vivid illustration of the failure in practice: users don't describe gradual forgetting but rather a sharp discontinuity: "We just discussed this. You built that feature. Why are you asking me again?" The agent before and after context compaction presents as two different entities. One informed, one naive. There are two broad ways people are trying to make agents learn and adapt over time, and each comes with its own tradeoffs. The first is the traditional route: updating model weights through gradient-based learning. This includes fine-tuning and reinforcement learning. It works, but it’s expensive and data-hungry. Frequent updates add significant compute cost, and there’s a well-known issue of catastrophic forgetting, where improving one capability can quietly degrade others. This makes it hard to maintain stable performance as the system evolves. The second approach operates entirely in token space, using in-context learning. Instead of relying on weight updates, the agent learns by managing what it sees in its context window. Idea is to actively curate the context and refine the memory. The third approach is newer and avoids gradients altogether. These gradient-free methods shift learning to test time without updating model weights. There’s a 2025 paper by Google Research that discusses Nested Learning. It treats a model as a stack of learning problems nested inside each other, each operating at different time-scales, rather than viewing a model as one learning process. In short, "true" continual learning where a personal AI agent genuinely improves from experience without forgetting, without privacy leakage remains an open research problem. There are many research directions being explored. While some claim most promising near-term path is gradient-free, memory-centric architectures that operate in token space. While some believe neuroscience-inspired multi-tier memory systems. While weight-based fine-tuning remains powerful but fragile and expensive for real-time personal use. This is an interesting space to watch. But one thing is very clear. Continual learning remains non-negotiable. We are huge on continual learning and and working towards an agent that learns and grows with you to be an reliable assistant We’re at Quarq Labs are building a personal agent at @quarqlabs where the philpsophy of continual learning sits at the centre. The goal is straightforward: an agent that works out of the box, without requiring users to assemble infrastructure around it. We’ll be opening an early beta soon. If you want to see how this approach performs in practice, including benchmark results, you can join the waitlist: #waitlist" target="_blank" rel="nofollow noopener">quarq.io/#waitlist.

English

0

1

2

119

Quarq retweetledi

samyak@smykx·13h

continual learning has always been an interesting topic. while this topic travelled from tradition ML to applied ai some of it notions have changed. while still remain but one thing is sure. we need agents that learn with us. whether it be through weight update or through context management this post will discuss both of these paradigms and share some alpha from research papers and product labs. happy reading!

Quarq@quarqlabs

Modern AI agents must learn continually from new tasks and data. Unlike static models trained once on fixed data, agents deployed in the real world encounter evolving environments. Without continual learning, an agent's knowledge becomes stale and counterproductive. The biggest gap between AI agents and human intelligence is the ability to learn. Humans continually learn and improve over time, acquire new skills, correct their past mistakes. In contrast, most AI agents have an incredible amount of world knowledge, but do not meaningfully get better over time. Research has converged on a critical insight: there are fundamentally two ways an AI agent can learn in-weights and in-context. Traditionally, the concept of "continual learning" for neural networks has been synonymous with weight updates. But a modern LLM agent is defined not just by model weights θ, but by the pair (θ, C), where C is the context window. This opens up a second axis: rather than updating weights, we can update the tokens that condition the model's behavior what Letta calls "learning in token space.” Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents often stray away from their original tasks. A vivid illustration of the failure in practice: users don't describe gradual forgetting but rather a sharp discontinuity: "We just discussed this. You built that feature. Why are you asking me again?" The agent before and after context compaction presents as two different entities. One informed, one naive. There are two broad ways people are trying to make agents learn and adapt over time, and each comes with its own tradeoffs. The first is the traditional route: updating model weights through gradient-based learning. This includes fine-tuning and reinforcement learning. It works, but it’s expensive and data-hungry. Frequent updates add significant compute cost, and there’s a well-known issue of catastrophic forgetting, where improving one capability can quietly degrade others. This makes it hard to maintain stable performance as the system evolves. The second approach operates entirely in token space, using in-context learning. Instead of relying on weight updates, the agent learns by managing what it sees in its context window. Idea is to actively curate the context and refine the memory. The third approach is newer and avoids gradients altogether. These gradient-free methods shift learning to test time without updating model weights. There’s a 2025 paper by Google Research that discusses Nested Learning. It treats a model as a stack of learning problems nested inside each other, each operating at different time-scales, rather than viewing a model as one learning process. In short, "true" continual learning where a personal AI agent genuinely improves from experience without forgetting, without privacy leakage remains an open research problem. There are many research directions being explored. While some claim most promising near-term path is gradient-free, memory-centric architectures that operate in token space. While some believe neuroscience-inspired multi-tier memory systems. While weight-based fine-tuning remains powerful but fragile and expensive for real-time personal use. This is an interesting space to watch. But one thing is very clear. Continual learning remains non-negotiable. We are huge on continual learning and and working towards an agent that learns and grows with you to be an reliable assistant We’re at Quarq Labs are building a personal agent at @quarqlabs where the philpsophy of continual learning sits at the centre. The goal is straightforward: an agent that works out of the box, without requiring users to assemble infrastructure around it. We’ll be opening an early beta soon. If you want to see how this approach performs in practice, including benchmark results, you can join the waitlist: #waitlist" target="_blank" rel="nofollow noopener">quarq.io/#waitlist.

English

0

1

121

Quarq@quarqlabs·13h

Modern AI agents must learn continually from new tasks and data. Unlike static models trained once on fixed data, agents deployed in the real world encounter evolving environments. Without continual learning, an agent's knowledge becomes stale and counterproductive. The biggest gap between AI agents and human intelligence is the ability to learn. Humans continually learn and improve over time, acquire new skills, correct their past mistakes. In contrast, most AI agents have an incredible amount of world knowledge, but do not meaningfully get better over time. Research has converged on a critical insight: there are fundamentally two ways an AI agent can learn in-weights and in-context. Traditionally, the concept of "continual learning" for neural networks has been synonymous with weight updates. But a modern LLM agent is defined not just by model weights θ, but by the pair (θ, C), where C is the context window. This opens up a second axis: rather than updating weights, we can update the tokens that condition the model's behavior what Letta calls "learning in token space.” Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents often stray away from their original tasks. A vivid illustration of the failure in practice: users don't describe gradual forgetting but rather a sharp discontinuity: "We just discussed this. You built that feature. Why are you asking me again?" The agent before and after context compaction presents as two different entities. One informed, one naive. There are two broad ways people are trying to make agents learn and adapt over time, and each comes with its own tradeoffs. The first is the traditional route: updating model weights through gradient-based learning. This includes fine-tuning and reinforcement learning. It works, but it’s expensive and data-hungry. Frequent updates add significant compute cost, and there’s a well-known issue of catastrophic forgetting, where improving one capability can quietly degrade others. This makes it hard to maintain stable performance as the system evolves. The second approach operates entirely in token space, using in-context learning. Instead of relying on weight updates, the agent learns by managing what it sees in its context window. Idea is to actively curate the context and refine the memory. The third approach is newer and avoids gradients altogether. These gradient-free methods shift learning to test time without updating model weights. There’s a 2025 paper by Google Research that discusses Nested Learning. It treats a model as a stack of learning problems nested inside each other, each operating at different time-scales, rather than viewing a model as one learning process. In short, "true" continual learning where a personal AI agent genuinely improves from experience without forgetting, without privacy leakage remains an open research problem. There are many research directions being explored. While some claim most promising near-term path is gradient-free, memory-centric architectures that operate in token space. While some believe neuroscience-inspired multi-tier memory systems. While weight-based fine-tuning remains powerful but fragile and expensive for real-time personal use. This is an interesting space to watch. But one thing is very clear. Continual learning remains non-negotiable. We are huge on continual learning and and working towards an agent that learns and grows with you to be an reliable assistant We’re at Quarq Labs are building a personal agent at @quarqlabs where the philpsophy of continual learning sits at the centre. The goal is straightforward: an agent that works out of the box, without requiring users to assemble infrastructure around it. We’ll be opening an early beta soon. If you want to see how this approach performs in practice, including benchmark results, you can join the waitlist: #waitlist" target="_blank" rel="nofollow noopener">quarq.io/#waitlist.

English

0

3

354

Quarq retweetledi

vaibhav@0xVK__·14h

LongMemEval is an interesting one, and probably the most well known benchmark for evaluating an agentic system’s memory across sessions. Memory is a core part of “continual learning.” It tests whether an agent can retain facts, user preferences, and even implied context across multiple chats or sessions. The other part is applying those memories to perform the task, which is not covered by LongMemEval If memory is solved properly, it becomes much easier to retrieve the right information and use it effectively to perform tasks aligned with the user’s query. We’re seeing strong results on LongMemEval with @quarqlabs agents. We’re running the evals one last time to be 100% sure before sharing the results publicly. It’s also important for us that anyone can verify our results and replicate the performance on their end. We’ll be open-sourcing a verification framework to ensure full transparency. Looking forward!

Quarq@quarqlabs

x.com/i/article/2049…

English

0

2

3

152

Quarq retweetledi

samyak@smykx·1d

tonnes of memory startups use longmemeval as their memory benchmark this short article would be a discussion around how good is longmemeval and should we using it to test agent?

Quarq@quarqlabs

x.com/i/article/2049…

English

1

2

4

680

Quarq@quarqlabs·1d

x.com/i/article/2049…

ZXX

0

3

14

1.4K

Quarq retweetledi

sourav bera@Sourav_Bera_·1d

We’re getting very close. Quarq is almost ready. This is going to feel different. Soon. @quarqlabs

English

0

5

11

198

Quarq retweetledi

GPT Maestro@GptMaestro·4d

Clear explanation of how GEPA optimizes prompts before inference instead of just cramming more into the context window. x.com/quarqlabs/stat…

Quarq@quarqlabs

x.com/i/article/2048…

English

0

12

126

16.5K

Quarq retweetledi

vaibhav@0xVK__·4d

We’ve been digging into agent benchmarks to understand what they actually measure. We'll be sharing our insights. LongCoT. It evaluates long, multi step reasoning on hard problems, and does it well. But fundamentally, it measures single shot reasoning: how reliably a model can solve an individual problem in one go. No memory. No adaptation. No learning across sessions. That’s useful, but incomplete. Real agents don’t just reason once. They operate over time. They learn, adapt, and improve. LongCoT, coupled with other benchmarks (which we'll talk about), is the proper way of judging agents.

Quarq@quarqlabs

x.com/i/article/2049…

English

0

1

3

242

Quarq retweetledi

vaibhav@0xVK__·5d

We have a multi-layered memory stack that combines vector and tag based storage with reasoning driven retrieval. We also classify memory into semantic, episodic, and procedural layers Our core thesis is that continuous learning and personalization will be solved by training per user SLMs for smarter orchestration in the harness. This version is a starting point, and the core differentiation is the reasoning driven retrieval, and the results are crazy which will be announcing probably tomorrow! Talked more about it our thesis here: x.com/0xVK__/status/…

English

1

2

5

163

Quarq@quarqlabs·5d

x.com/i/article/2049…

ZXX

0

6

36

2.4K

Quarq retweetledi

Rithvik@BngRithvik·5d

Has high potential to beat all the competitors 👀

vaibhav@0xVK__

@quarqlabs agents have come to life internally! living systems that remember, adapt, and improve over time beta soon 👀

English

0

2

5

586

Quarq retweetledi

vaibhav@0xVK__·5d

@quarqlabs agents have come to life internally! living systems that remember, adapt, and improve over time beta soon 👀

English

3

4

9

830

Quarq retweetledi

vaibhav@0xVK__·6d

Reminds me of Deidara vs Isobu in Naruto Immense power, but no ability to actually "harness" it

vaibhav@0xVK__

It’s not intuitive to most people that the current ceiling isn’t model intelligence, it’s how well we utilize that intelligence. Everything we’re seeing - harness engineering - context engineering - planner nodes - RLMs - etc. is converging on one idea - better orchestration Models today are "Mismanaged Geniuses" “These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management.”

English

0

1

3

202

Quarq retweetledi

samyak@smykx·6d

@a1zhang and @raw_works 's experiments have been super fun to follow. in this article i tried to cover their work around LongCot benchmarking, also threw some light on alex's "mismanaged geniuses hypothesis" i tried it keeping it easy to follow. hope everyone enjoys the read :)

Quarq@quarqlabs

2 weeks ago @raw_works's profile published an announcement about hitting state-of-the-art on LongCoT. A relatively small model like Qwen3.5-9B beat GPT-5.2 on a long-horizon reasoning benchmark by over 60% using the right scaffold. That question is "Is true intelligence just locked behind the right scaffolding" First, What Is Long CoT, and Why Does It Matter? LongCoT is a benchmark for difficult reasoning problems. It is specifically designed to measure whether models can sustain coherent reasoning over extremely long horizons. The tasks span mathematics, chemistry, computer science, chess, and logic, where each individual reasoning step is usually within the capability of frontier models. The difficulty comes from maintaining correctness across a massive graph of interdependent steps that can stretch across tens to hundreds of thousands of reasoning tokens. These tasks break most models and act as a real test of complex task solving abilities. Let's talk about what @a1zhang (MIT CSAIL) published recently. Using a refined prompting setup within the RLM harness, they pushed performance on LongCoT-mini from 38.7% to 65.6%. A nearly 2x improvement on one of the hardest compositional reasoning benchmarks out there, just from better scaffold design. Earlier results with dspy.RLM on Claude Sonnet 4.5 showed a jump from roughly 13% to 45.4% overall. Specific categories like Dungeon, Packaging, Hanoi, Sudoku, and Wizards went from near-zero to perfect scores. Chess hit 85 out of 100. Then there's @raw_works's result: Qwen3.5-9B paired with dspy.RLM achieved 15.69% on LongCoT-Full compared to GPT-5.2's 9.83%. A 9 billion parameter open model beating one of the most capable frontier models available, by a meaningful margin, on a hard benchmark. The 27B variant ranked highly on the mini split too, beating models many times its size. It's Not Just LongCoT. This same pattern is showing up across benchmark categories. On LongMemEval, dspy.RLM variants are consistently hitting 87–89.8% accuracy. A model like Gemini 3 Flash paired with dspy.RLM and observational memory reached 89.8% at roughly $0.035 per query. That's approaching dedicated memory system benchmarks like Mastra (~95%) and Vectorize Hindsight (~91%), without any specialized memory architecture. On multi-hop reasoning tasks and large-context aggregation problems where you're slicing through 10 million+ tokens and need to pull out specific signals RLMs are outperforming both vanilla long-context models and traditional RAG setups. The Takeaway @a1zhang's The Mismanaged Geniuses Hypothesis is very apt here. These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management. Standard prompting essentially hands a genius a disorganized to-do list and wonders why they underperform. RLMs fix this. By giving the model a recursive execution environment a shared REPL state, typed inputs and outputs via DSPy signatures, structured delegation. The models we already have are more capable than our current interfaces allow them. RLMs, and DSPy's implementation in particular, are surfacing that latent capability at scale. It would be interesting to watch this space and see far RLMs take us. These are the sources which will allow you to go deeper: @a1zhang, @raw_works, alexzhang13.github.io alexzhang13.github.io

English

0

5

15

751

Quarq retweetledi

vaibhav@0xVK__·6d

It’s not intuitive to most people that the current ceiling isn’t model intelligence, it’s how well we utilize that intelligence. Everything we’re seeing - harness engineering - context engineering - planner nodes - RLMs - etc. is converging on one idea - better orchestration Models today are "Mismanaged Geniuses" “These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management.”

Quarq@quarqlabs

2 weeks ago @raw_works's profile published an announcement about hitting state-of-the-art on LongCoT. A relatively small model like Qwen3.5-9B beat GPT-5.2 on a long-horizon reasoning benchmark by over 60% using the right scaffold. That question is "Is true intelligence just locked behind the right scaffolding" First, What Is Long CoT, and Why Does It Matter? LongCoT is a benchmark for difficult reasoning problems. It is specifically designed to measure whether models can sustain coherent reasoning over extremely long horizons. The tasks span mathematics, chemistry, computer science, chess, and logic, where each individual reasoning step is usually within the capability of frontier models. The difficulty comes from maintaining correctness across a massive graph of interdependent steps that can stretch across tens to hundreds of thousands of reasoning tokens. These tasks break most models and act as a real test of complex task solving abilities. Let's talk about what @a1zhang (MIT CSAIL) published recently. Using a refined prompting setup within the RLM harness, they pushed performance on LongCoT-mini from 38.7% to 65.6%. A nearly 2x improvement on one of the hardest compositional reasoning benchmarks out there, just from better scaffold design. Earlier results with dspy.RLM on Claude Sonnet 4.5 showed a jump from roughly 13% to 45.4% overall. Specific categories like Dungeon, Packaging, Hanoi, Sudoku, and Wizards went from near-zero to perfect scores. Chess hit 85 out of 100. Then there's @raw_works's result: Qwen3.5-9B paired with dspy.RLM achieved 15.69% on LongCoT-Full compared to GPT-5.2's 9.83%. A 9 billion parameter open model beating one of the most capable frontier models available, by a meaningful margin, on a hard benchmark. The 27B variant ranked highly on the mini split too, beating models many times its size. It's Not Just LongCoT. This same pattern is showing up across benchmark categories. On LongMemEval, dspy.RLM variants are consistently hitting 87–89.8% accuracy. A model like Gemini 3 Flash paired with dspy.RLM and observational memory reached 89.8% at roughly $0.035 per query. That's approaching dedicated memory system benchmarks like Mastra (~95%) and Vectorize Hindsight (~91%), without any specialized memory architecture. On multi-hop reasoning tasks and large-context aggregation problems where you're slicing through 10 million+ tokens and need to pull out specific signals RLMs are outperforming both vanilla long-context models and traditional RAG setups. The Takeaway @a1zhang's The Mismanaged Geniuses Hypothesis is very apt here. These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management. Standard prompting essentially hands a genius a disorganized to-do list and wonders why they underperform. RLMs fix this. By giving the model a recursive execution environment a shared REPL state, typed inputs and outputs via DSPy signatures, structured delegation. The models we already have are more capable than our current interfaces allow them. RLMs, and DSPy's implementation in particular, are surfacing that latent capability at scale. It would be interesting to watch this space and see far RLMs take us. These are the sources which will allow you to go deeper: @a1zhang, @raw_works, alexzhang13.github.io alexzhang13.github.io

English

2

1

3

576

Quarq@quarqlabs·6d

I think its obvious but we still wanted to bring it to attention that these results are not meant to be added to LongCoT leaderboard . These benchmarks were ran to highlight RLMs capabilities and how models are intelligible enough to complete these tasks. And how they have been held back by scaffolding! you can read @a1zhang 's complete piece here: x.com/a1zhang/status…

English

0

2

220

Quarq@quarqlabs·6d

2 weeks ago @raw_works's profile published an announcement about hitting state-of-the-art on LongCoT. A relatively small model like Qwen3.5-9B beat GPT-5.2 on a long-horizon reasoning benchmark by over 60% using the right scaffold. That question is "Is true intelligence just locked behind the right scaffolding" First, What Is Long CoT, and Why Does It Matter? LongCoT is a benchmark for difficult reasoning problems. It is specifically designed to measure whether models can sustain coherent reasoning over extremely long horizons. The tasks span mathematics, chemistry, computer science, chess, and logic, where each individual reasoning step is usually within the capability of frontier models. The difficulty comes from maintaining correctness across a massive graph of interdependent steps that can stretch across tens to hundreds of thousands of reasoning tokens. These tasks break most models and act as a real test of complex task solving abilities. Let's talk about what @a1zhang (MIT CSAIL) published recently. Using a refined prompting setup within the RLM harness, they pushed performance on LongCoT-mini from 38.7% to 65.6%. A nearly 2x improvement on one of the hardest compositional reasoning benchmarks out there, just from better scaffold design. Earlier results with dspy.RLM on Claude Sonnet 4.5 showed a jump from roughly 13% to 45.4% overall. Specific categories like Dungeon, Packaging, Hanoi, Sudoku, and Wizards went from near-zero to perfect scores. Chess hit 85 out of 100. Then there's @raw_works's result: Qwen3.5-9B paired with dspy.RLM achieved 15.69% on LongCoT-Full compared to GPT-5.2's 9.83%. A 9 billion parameter open model beating one of the most capable frontier models available, by a meaningful margin, on a hard benchmark. The 27B variant ranked highly on the mini split too, beating models many times its size. It's Not Just LongCoT. This same pattern is showing up across benchmark categories. On LongMemEval, dspy.RLM variants are consistently hitting 87–89.8% accuracy. A model like Gemini 3 Flash paired with dspy.RLM and observational memory reached 89.8% at roughly $0.035 per query. That's approaching dedicated memory system benchmarks like Mastra (~95%) and Vectorize Hindsight (~91%), without any specialized memory architecture. On multi-hop reasoning tasks and large-context aggregation problems where you're slicing through 10 million+ tokens and need to pull out specific signals RLMs are outperforming both vanilla long-context models and traditional RAG setups. The Takeaway @a1zhang's The Mismanaged Geniuses Hypothesis is very apt here. These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management. Standard prompting essentially hands a genius a disorganized to-do list and wonders why they underperform. RLMs fix this. By giving the model a recursive execution environment a shared REPL state, typed inputs and outputs via DSPy signatures, structured delegation. The models we already have are more capable than our current interfaces allow them. RLMs, and DSPy's implementation in particular, are surfacing that latent capability at scale. It would be interesting to watch this space and see far RLMs take us. These are the sources which will allow you to go deeper: @a1zhang, @raw_works, alexzhang13.github.io alexzhang13.github.io

English

3

10

93

5K

Quarq retweetledi

vaibhav@0xVK__·28 Nis

@quarqlabs Join the waitlist: #waitlist" target="_blank" rel="nofollow noopener">quarq.io/#waitlist

English

0

2

5

314

Quarq retweetledi

vaibhav@0xVK__·28 Nis

GEPA optimizes the static parts of the harness, leading to meaningful gains in overall agent performance At @quarqlabs, we’re running experiments with GEPA and will be sharing our learnings soon Optimizations like this, across multiple layers of the stack, are what make continual learning feel within reach When we started, the path wasn’t clear. Now, it’s starting to come into focus, and our conviction is stronger than ever With our upcoming beta, you’ll see just how close we are to true continual learning

Quarq@quarqlabs

x.com/i/article/2048…

English

1

2

8

500

Quarq

Keşfet