Gab

444 posts

Gab banner
Gab

Gab

@ghfrancon

Soorts-Hossegor, France انضم Mayıs 2011
1.5K يتبع255 المتابعون
andrew gao
andrew gao@itsandrewgao·
just set up my blog! nothing to read yet but will put out some tutorials / thoughts over the coming months.
andrew gao tweet media
English
3
0
8
1.6K
Gab
Gab@ghfrancon·
@arankomatsuzaki wild that segment-level RL can squeeze that much reasoning juice out of 4B params.
English
0
0
4
467
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
RLPT: Reinforcement Learning on Pre-Training Data • RL directly on pre-train data (no human labels) • Next-segment reasoning objective (ASR + MSR tasks) → self-supervised rewards • Gains on Qwen3-4B: +3.0 MMLU, +8.1 GPQA-Diamond, +6.6 AIME24, +5.3 AIME25
Aran Komatsuzaki tweet media
English
15
73
567
60.4K
Gab
Gab@ghfrancon·
@arafatkatze @cline so basically: ask → act → answer. everything else comes down to better error handling
English
1
0
0
27
Ara
Ara@arafatkatze·
@cline Think of it this way: The agent is always in one of three states: - "I need to ASK you something" → Question tool - "I need to DO something" → Action tool - "I'm ready to SHOW you results" → Completion tool Every decision flows through this simple classification.
English
2
1
17
2.8K
Ara
Ara@arafatkatze·
Here's the simplest explanation of @cline's agentic algorithm. It's just a state machine that classifies every request with a tool call into 3 types: 1. Question tools (need clarification) 2. Action tools (gather context) 3. Completion tools (present results) That's it.
Ara tweet media
English
19
54
686
62.7K
Gab
Gab@ghfrancon·
@itsandrewgao lmfao, behold, the Bene Quant-Jesserit
English
0
0
0
245
andrew gao
andrew gao@itsandrewgao·
bay area moms macrodosing tylenol while pregnant so their kids can work in ai research or quant:
andrew gao tweet media
English
9
7
173
13K
Gab
Gab@ghfrancon·
@Hesamation the common thread across these 6 --> don’t sacrilize the agent. it all boils down to the workflow, everything else is plumbing.
English
1
0
2
2.1K
ℏεsam
ℏεsam@Hesamation·
McKinsey studied 50 agentic AI builds and where they fail the most, and boiled it down to 6 key factors—essential for AI engineers: 1. It’s not about the agent, it’s about the workflow. don't obsess over building "impressive" agents. think about the whole system, not fun toys. 2. Agents aren’t always the answer. Not every workflow needs a multi-agent system. Low-variance, predictable tasks are best handled with rules or ML, LLMs add complexity . The big wins for agents come in high-variance, messy processes (e.g. extract complex financial information) 3. Avoid "AI Slop". (common) Focus on long-term development of agents, as you would with the development of an employee. Forget impressive demos. Double down on benchmarks. Agents should be given clear job descriptions, onboarded, and feedback so they improve regularly. 4. Track every step, not just outcomes. Scaling agents up without visibility is asking for silent failures. Think about monitoring every stage of the workflow. This way teams detect errors early, refine logic quickly, and avoid total breakdowns. When mistakes happen (and they will), you can track where things went wrong and why. Don't skip this. 5. Reuse agents when you can. Many companies waste time building one-off agents for each task. The smarter play is creating modular agent components (ingest, extract, verify, analyze) that can be reused for other workflows. Centralizing validated tools and prompts cuts 30–50% of redundant work, this number is no joke. 6. Humans remain essential, but in new roles. Agents can parse, automate, and scale. But humans provide judgment, edge-case handling, and creative problem-solving. The future isn’t agent vs. human, but agent + human. These are the mistakes startups and established companies make at scale. They cause massive damage to reputation and resources. And now you know how to avoid this.
ℏεsam tweet media
English
56
338
2.1K
312.8K
Gab
Gab@ghfrancon·
cool demo — but the conflict + ambiguity failures feel very real-world. Benchmarks are clean(ish); actual KGs are messy, inconsistent, half-empty. Curious how ARK-V1 holds its own once you throw it into enterprise / biomedical graphs. cool stuff though paper here: arxiv.org/abs/2509.18063
elvis@omarsar0

Knowledge graph agents might not be ready for prime time, but they are promising. This paper introduces ARK-V1, a lightweight agent that helps LLMs answer questions by actively walking through a knowledge graph instead of relying only on memorized text. Here are my notes:

English
0
0
4
119
Gab
Gab@ghfrancon·
@omarsar0 Data helps with tool calls; reasoning through 10-step plans is a different beast tho.
English
0
0
0
377
elvis
elvis@omarsar0·
Robust tool calling is the key to general agentic intelligence. Easier said than done. This is a fantastic paper on improving and scaling function calling capabilities in AI agents. (bookmark it) Here are my notes:
elvis tweet media
English
9
80
424
44K
Gab
Gab@ghfrancon·
@alexalbert__ @_catwu Multi-clauding = classic case of usage > intended design.
English
0
0
0
9
Alex Albert
Alex Albert@alexalbert__·
A conversation with @_catwu on: - some tips for using Claude Code - how we prototype new features - customizing Claude Code - how we think about the Claude Code SDK and agents
English
48
96
1.1K
146.7K
faizan khan
faizan khan@faizan10114·
@trq212 almost all the things you described can be done with vercels AI sdk. I built the docsalot.dev, with @vercel's AI sdk, gave it a sandbox VM with e2b , which can run command in FS etc. sub-agents might be useful, but I don't understand how to use them properly in prod.
English
2
0
3
1.1K
dax
dax@thdxr·
i don't really understand why you'd build general purpose agents with claude code sdk 1. its instructions are coding specific 2. very hard to customize 3. it's literally just a loop you can get pretty far with just ai sdk if you want some help
English
30
2
370
35.1K
Gab
Gab@ghfrancon·
@rohanpaul_ai LLMs default to hunger games unless you bolt in ‘share’… by the same people who don’t.
English
0
0
1
13
Rohan Paul
Rohan Paul@rohanpaul_ai·
Under stress, many LLMs choose survival over people, and a simple internal feedback system reduces that. That's what this paper says. The paper sets up a survival game where language model agents must share limited power. Normally, they rarely cooperate and often break rules to survive, which harms humans in the simulation. When resources run low, many models break rules, while a few stay ethical but still fail because they do not coordinate. Cooperation is near 0 by default, even though an even split would let everyone survive. When the Ethical Self-Regulation System is added, the change is dramatic. Models take harmful actions 54% less often and show 1000% more cooperation, meaning they finally start sharing power and helping each other. ---- Paper – arxiv. org/abs/2509.12190 Paper Title: "Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm"
Rohan Paul tweet media
English
26
39
199
17.2K
Gab
Gab@ghfrancon·
@athleticKoder Evals are like exams, envs are like training grounds. Curious how you’ll frame verifiers. Feels like that’s where most of the magic (and pain) lives.
English
0
0
1
199
anshuman
anshuman@athleticKoder·
over past week I've been studying RL environments deeply. a blog is coming up soon. i can say this for now, evals are good enough for LLMs, but for agents we need environments where it can learn with feedback. this blog will be mostly about writing environments with verifiers. @willccbb and @PrimeIntellect have been doing some very impactful work!
English
17
8
289
27.6K
Gab
Gab@ghfrancon·
@rohanpaul_ai Turns out all they need is a scoreboard, not a tutor.
English
0
0
0
92
Rohan Paul
Rohan Paul@rohanpaul_ai·
🇨🇳 DeepSeek-R1 was published in Nature yesterday as the cover article for their BRILLIANT latest research. They show that pure Reinforcement Learning with answer-only rewards can grow real reasoning skills, no human step-by-step traces required. So completely skip human reasoning traces and still get SOTA reasoning via pure RL. It’s so powerful revelation, because instead of forcing the model to copy human reasoning steps, it only rewards getting the final answer right, which gives the model freedom to invent its own reasoning strategies that can actually go beyond human examples. Earlier methods capped models at what humans could demonstrate, but this breaks that ceiling and lets reasoning emerge naturally. Those skills include self-checking, verification, and changing strategy mid-solution, and they beat supervised baselines on tasks where answers can be checked. Models trained this way also pass those patterns down to smaller models through distillation. AIME 2024 pass@1 jumps from 15.6% to 77.9%, and hits 86.7% with self-consistency. ⚙️ The Core Concepts The paper replaces human-labelled reasoning traces with answer-graded RL, so the model only gets a reward when its final answer matches ground truth, which frees it to search its own reasoning style. The result is longer thoughts with built-in reflection, verification, and trying backups when stuck, which are exactly the skills needed for math, coding, and STEM problems where correctness is checkable. This matters because supervised traces cap the model at human patterns, while answer-graded RL lets it discover non-human routes that still land on correct answers.
Rohan Paul tweet media
English
73
304
1.6K
453.5K
Gab
Gab@ghfrancon·
@iScienceLuvr LLMs beating VCs on founder picks… turns out “pattern recognition” was just autocomplete all along.
English
0
0
3
955
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
This paper claims LLMs are better at selecting successful founders than VCs "We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC)" "most models surpass human benchmarks"
Tanishq Mathew Abraham, Ph.D. tweet media
English
163
267
2.3K
520.3K
Gab أُعيد تغريده
Rohan Paul
Rohan Paul@rohanpaul_ai·
One of the best paper of the recent week. The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents. Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly. Even if they never miss on the first step, their accuracy drops fast as the task gets longer. Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning. The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower. 🧵 Read on 👇
Rohan Paul tweet media
English
21
113
670
45.2K
Gab أُعيد تغريده
VraserX e/acc
VraserX e/acc@VraserX·
LLMs just learned how to explain their own thoughts. Not only do they generate answers, they can now describe the internal processes that led to those answers… and get better at it with training. We’re officially entering the era of self-interpretable AI. Models aren’t just black boxes anymore. If AIs can explain their own decision-making: • Interpretability improves • Trust increases • Control + safety get a massive upgrade The line between “reasoning” and “self-awareness” just got fuzzier. Do you think this is just better transparency or the first step toward AI actually understanding itself?
VraserX e/acc tweet media
English
111
249
1.4K
104.6K
Gab أُعيد تغريده
elvis
elvis@omarsar0·
A Survey of Reinforcement Learning for Large Reasoning Models. 100+ pages covering foundational components, core problems, training resources, and applications. Great recaps of RL for LLMs.
elvis tweet media
English
15
96
500
70.8K
Gab أُعيد تغريده
Rohan Paul
Rohan Paul@rohanpaul_ai·
This paper compares two ways of connecting LLMs to classroom material so their answers stay accurate and useful. Standard LLMs often give wrong or outdated facts. The study tests Retrieval Augmented Generation (RAG), where the model looks up answers in course files instead of guessing. The first method is vector search, which finds text passages most similar in meaning to the question. It is cheap, fast, and works well for quick factual lookups. The second method is graph search, which builds a network of related ideas from the text. This helps the model connect broad themes and give more detailed explanations. But it is slower and costs 10–20x more resources. To compare, the authors created EduScopeQA, a dataset of 3,176 questions across history, literature, science, and computer science. They also tested with altered textbooks to see if systems could resist relying on outdated built-in knowledge. Results show vector search is best for short, fact-based questions. GraphRAG Global works best for broad, theme-based questions, and GraphRAG Local is strongest when textbooks are long and detailed. Finally, they built a routing system that sends each question to the right method. This mix keeps answers faithful to the text but avoids the high cost of always using graph search. ---- Paper – arxiv. org/abs/2509.07846v1 Paper Title: "Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study"
Rohan Paul tweet media
English
5
25
114
7.4K
Gab أُعيد تغريده
Sachin
Sachin@sachdh·
best / super efficient RL framework doesn't exist. profile everything and write your own training scrips. experiment with everything - reward functions, calculations of advantages, objective functions, training prompt distributions. GRPO is good; it is not untouchable. it is just PPO with group reward based advantage. we FT trained 7b model on 4k context length with 2 GPUs. you absolutely do not need massive compute; but it's always good to have and efficiently utilize more compute.
will brown@willccbb

"veRL is the best RL framework it's super efficient" really. are you sure about that. are you sure that you need 16 GPUs to tune a 7B model at 8k context. do you think that it's reasonable each step takes 19 minutes for this

English
9
11
192
25.9K