Debstep

2.6K posts

Debstep banner
Debstep

Debstep

@xdebstep

💻@sublime_sec | Prev @nvidia @tesla @visa @amazon also a @kp_fellows

San Francisco, CA Sumali Eylül 2014
2.6K Sinusundan805 Mga Tagasunod
Debstep nag-retweet
Akshay 🚀
Akshay 🚀@akshay_pachaar·
8 RAG architectures for AI Engineers: (explained with usage) 1) Naive RAG - Retrieves documents purely based on vector similarity between the query embedding and stored embeddings. - Works best for simple, fact-based queries where direct semantic matching suffices. 2) Multimodal RAG - Handles multiple data types (text, images, audio, etc.) by embedding and retrieving across modalities. - Ideal for cross-modal retrieval tasks like answering a text query with both text and image context. 3) HyDE (Hypothetical Document Embeddings) - Queries are not semantically similar to documents. - This technique generates a hypothetical answer document from the query before retrieval. - Uses this generated document’s embedding to find more relevant real documents. 4) Corrective RAG - Validates retrieved results by comparing them against trusted sources (e.g., web search). - Ensures up-to-date and accurate information, filtering or correcting retrieved content before passing to the LLM. 5) Graph RAG - Converts retrieved content into a knowledge graph to capture relationships and entities. - Enhances reasoning by providing structured context alongside raw text to the LLM. 6) Hybrid RAG - Combines dense vector retrieval with graph-based retrieval in a single pipeline. - Useful when the task requires both unstructured text and structured relational data for richer answers. 7) Adaptive RAG - Dynamically decides if a query requires a simple direct retrieval or a multi-step reasoning chain. - Breaks complex queries into smaller sub-queries for better coverage and accuracy. 8) Agentic RAG - Uses AI agents with planning, reasoning (ReAct, CoT), and memory to orchestrate retrieval from multiple sources. - Best suited for complex workflows that require tool use, external APIs, or combining multiple RAG techniques. 👉 Over to you: Which RAG architecture do you use the most? _____ Share this with your network if you found this insightful ♻️ Find me → @akshay_pachaar ✔️ For more insights and tutorials on LLMs, AI Agents, and Machine Learning!
Akshay 🚀 tweet media
English
26
146
629
25K
Debstep nag-retweet
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
New blog post coming out tomorrow morning on LLM benchmarking. The best way to understand how LLM benchmarks are created—and how we can create a useful benchmark for our own task of interest—is to simply study details of the most popular and effective LLM benchmarks. This post will study the following properties of a wide variety of benchmarks: 1. How the data is sourced 2. How data quality is ensured 3. How model performance is measured 4. How each benchmark has evolved as models have improved Although many LLM benchmarks exist, there are a ton of common properties shared by the most successful benchmarks that can easily be adopted as a set of best practices: - Creating a domain taxonomy so that the benchmark is structured and guaranteed to be diverse. - Leveraging human expertise (for sourcing data, verification, and more). - Using a model-in-the-loop approach to make data collection more efficient and ensure difficulty. - Putting strict data quality checks in place. - Making sure the benchmark is realistic and matches real-world usage of the LLM. - Evolving the benchmark over time to enhance difficulty and capture new dimensions of performance.
Cameron R. Wolfe, Ph.D. tweet media
English
10
25
149
15.1K
Debstep nag-retweet
Khushi Patil
Khushi Patil@KhushiPatil25·
85 AI Terms explained well 📘📚
Khushi Patil tweet media
English
12
56
194
5.8K
Debstep nag-retweet
Nainsi Dwivedi
Nainsi Dwivedi@NainsiDwiv50980·
🚨BREAKING: The CEO behind Claude just dropped a 38-page memo most people completely misunderstood. Everyone focused on “AI will replace jobs.” But buried in the middle is something far more powerful: A reasoning system for thinking with AI, not competing against it. These prompts aren’t shortcuts. They’re thinking frameworks. And the people using them will look 10x smarter than everyone else. Here are 9 advanced Claude prompts inspired by Amodei’s methodology that upgrade how you think, decide, and build in the AI era:
Nainsi Dwivedi tweet media
English
3
16
52
9.4K
Debstep nag-retweet
Al Tach
Al Tach@MindsetF28912·
Al Tach tweet media
ZXX
7
33
82
1.6K
Debstep nag-retweet
Nav Toor
Nav Toor@heynavtoor·
🚨 An AI just wrote a scientific paper. Came up with the hypothesis. Designed the experiments. Ran the code. Analyzed the data. Created the figures. Wrote every word. Then it passed peer review at a top machine learning conference. No human touched it. Not one word. Not one edit. This is not a demo. This actually happened. At ICLR 2025. It's called AI Scientist v2. An open source system that does the entire scientific research process. Autonomously. End to end. From idea to published paper. Here's what this system does on its own: → Generates research hypotheses from a broad topic you provide → Searches existing literature to check if the idea is novel → Designs experiments to test the hypothesis → Writes and debugs its own experiment code → Runs the experiments on GPUs → Analyzes the results with statistical methods → Creates publication-ready figures and visualizations → Writes the entire manuscript. Title to references. LaTeX formatted. → Reviews its own paper and improves it before submission Here's the wildest part: They submitted 3 fully AI-generated papers to an ICLR workshop. Reviewers were told some papers might be AI-generated but not which ones. One paper scored 6, 7, and 6 from three reviewers. That put it in the top 45% of all submissions. Above the average human paper. The AI outscored most human researchers. At a real conference. Through blind peer review. PhD programs cost $50,000 to $80,000 per year. Research takes 5 to 7 years. Postdocs earn $55,000 for more years of the same grind. 2.2K GitHub stars. Published research paper. Apache 2.0 License. 100% Open Source.
Nav Toor tweet media
English
46
65
255
22.5K
Debstep nag-retweet
Yuexing Hao
Yuexing Hao@YuexingHao·
Hey world! I am looking for a Research Scientist (L4) team match at @GoogleDeepMind and @Google 🚀 I build things with RLHF, Model Evaluation, and Computer Using Agents — from agentic pipelines to RL systems. If you're recruiting, let's talk! My DM is OPEN
English
14
18
465
52.4K
Debstep nag-retweet
Dharmik Harinkhede
Dharmik Harinkhede@Dharmikpawar31·
🚨Breaking: Anthropic engineers revealed a simple trick they use internally. Claude agents can now remember how to improve themselves. A file called: AGENT_LEARNINGS.md The AI updates this file whenever it makes a mistake. Inside it: • mistakes it made • patterns to avoid • better approaches Before starting new tasks, the agent reads the file. Result? The agent gets smarter over time without retraining the model. This is called external memory scaffolding. Expect this pattern to show up in every serious AI agent system. ♻️ Repost to share with your audience. ✔️ You can follow @Dharmikpawar31 , for more internal updates.
Dharmik Harinkhede tweet media
English
35
47
164
8.7K
Debstep nag-retweet
Chidanand Tripathi
Chidanand Tripathi@thetripathi58·
🚨 Cambridge researchers just tested what happens when you overload an AI's memory with irrelevant data. They found a complete collapse of modern RAG systems. Not a minor hallucination. A total failure of the exact retrieval architecture that every enterprise AI relies on to access private data. The models simply drowned in the noise. The researchers tested standard Retrieval-Augmented Generation (RAG) and filtering models like Self-RAG. They fed them information but slowly increased the ratio of distracting, low-quality documents. Here is what they found. Current read-time filtering failed completely. When the ratio of distractors hit 8:1, the accuracy of standard RAG systems plummeted to 0%. The AI lost the ability to find the truth. It exposed a massive architectural flaw. We currently store every single document an AI reads, regardless of quality, and force the model to sort through the garbage at query time. It is highly inefficient and fundamentally broken. The biological fix. The researchers built a new system called "Write-Time Gating" modeled after the human hippocampus. Instead of saving everything, it evaluates novelty, reliability, and source reputation before the data is even stored. And then there is the finding that changes how we build AI: hierarchical archiving. When beliefs update, the system does not delete the old data. It deprioritizes it, maintaining a version history just like the human brain. The result? The write-gated system maintained 100% accuracy even at massive distractor scales, all while costing one-ninth the compute of current systems. The researchers made it clear. When you dump raw, unfiltered data into a database and expect the LLM to figure it out later, you are building a system designed to fail at scale. No reliable retrieval. No cost control. No accuracy guarantees. Nothing. Right now, companies are building massive vector databases, throwing every piece of corporate documentation into them, and assuming the AI will magically find the signal in the noise. Stop treating AI memory like a hard drive. Start treating it like a biological filter. Build the gate at the entrance, not the exit.
Chidanand Tripathi tweet media
English
17
35
104
21.3K
Debstep nag-retweet
alphaXiv
alphaXiv@askalphaxiv·
"Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" This paper shows that self-supervised vision models would work much better when they predict a teacher's hidden layers across the entire visual hierarchy. As they showed that just learning from the raw pixels or the final embeddings is not enough. This methodology would help the model learn both low-level grounding and high-level concepts all at once, obtaining stabler training and about 10% gain over I-JEPA!
alphaXiv tweet media
English
6
62
371
15.8K
Debstep nag-retweet
alphaXiv
alphaXiv@askalphaxiv·
"GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent" Instead of giving LLMs huge KV caches or one-shot compressed summaries, GradMem shows that you can let a frozen model take some test-time gradient steps to write a long context into a small memory. This helps recovering a lot more information later on, and turns some inference compute into long context compression.
alphaXiv tweet media
English
7
49
306
14.6K
Debstep nag-retweet
Jason Weston
Jason Weston@jaseweston·
🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/rllm/
Jason Weston tweet media
English
5
46
311
25.3K
Debstep nag-retweet
Nainsi Dwivedi
Nainsi Dwivedi@NainsiDwiv50980·
Most developers treat Claude Code like a smarter autocomplete. That's the wrong mental model. It's actually a 4-layer engineering system: 1️⃣ CLAUDE.md → persistent project memory Architecture, rules, team conventions 2️⃣ Skills → auto-invoked knowledge packs Testing patterns, code review, deploy workflows 3️⃣ Hooks → deterministic guardrails Security checks, formatting, automation 4️⃣ Agents → specialized sub-agents Break complex tasks into parallel workflows Once these are configured properly: Claude stops behaving like a chatbot. It starts behaving like a senior engineer on your team. Most people never reach this level because they skip the setup. The gap between average AI output and production-level results isn't the model. It's the infrastructure around it. Here’s the full breakdown on exactly how to build this 👇
Nainsi Dwivedi tweet media
English
7
20
93
7.9K
Debstep nag-retweet
Priyanka Vergadia
Priyanka Vergadia@pvergadia·
Andrej Karpathy defines Context Engineering as a real AI mastery skill, here's why! → Prompt engineering = how you phrase the ask → Context engineering = the full information environment the AI sees → Memory, tools, RAG, examples, history - all of it → Your prompt is often <1% of what the model processes Here's why this changes everything: LLMs don't think. They continue. The quality of what they continue from IS the skill. Prompt engineering helps the model walk. Context engineering helps it walk in the right direction @systemdesignone and I put this visual breakdown together, if you like it follow us on Substack for more👇
Priyanka Vergadia tweet media
English
11
45
165
6.4K
Debstep nag-retweet
alphaXiv
alphaXiv@askalphaxiv·
With how promising Evolution Strategies is as an RL alternative, we just completed our own research to compare and evaluate it against GRPO! Our major finding lines up with the ES paper: Evolution Strategies can beat GRPO even when you have only a little fine-tuning data. However, GRPO becomes the better choice once data grows or you start from base models. Hopefully this research we've conducted can serve as an additional rule of thumb if you want to work with ES. Check out our blog below👇
alphaXiv tweet media
English
8
41
253
25K
Debstep nag-retweet
Jenny Zhang
Jenny Zhang@jennyzhangzt·
Introducing Hyperagents: an AI system that not only improves at solving tasks, but also improves how it improves itself. The Darwin Gödel Machine (DGM) demonstrated that open-ended self-improvement is possible by iteratively generating and evaluating improved agents, yet it relies on a key assumption: that improvements in task performance (e.g., coding ability) translate into improvements in the self-improvement process itself. This alignment holds in coding, where both evaluation and modification are expressed in the same domain, but breaks down more generally. As a result, prior systems remain constrained by fixed, handcrafted meta-level procedures that do not themselves evolve. We introduce Hyperagents – self-referential agents that can modify both their task-solving behavior and the process that generates future improvements. This enables what we call metacognitive self-modification: learning not just to perform better, but to improve at improving. We instantiate this framework as DGM-Hyperagents (DGM-H), an extension of the DGM in which both task-solving behavior and the self-improvement procedure are editable and subject to evolution. Across diverse domains (coding, paper review, robotics reward design, and Olympiad-level math solution grading), hyperagents enable continuous performance improvements over time and outperform baselines without self-improvement or open-ended exploration, as well as prior self-improving systems (including DGM). DGM-H also improves the process by which new agents are generated (e.g. persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. This work was done during my internship at Meta (@AIatMeta), in collaboration with Bingchen Zhao (@BingchenZhao), Wannan Yang (@winnieyangwn), Jakob Foerster (@j_foerst), Jeff Clune (@jeffclune), Minqi Jiang (@MinqiJiang), Sam Devlin (@smdvln), and Tatiana Shavrina (@rybolos).
Jenny Zhang tweet media
English
154
658
3.6K
491.4K