Roman

81 posts

Roman

@romansvet

Sky's the Limit.

Katılım Nisan 2008

4.6K Takip Edilen102 Takipçiler

Roman retweetledi

Aksel@akseljoonas·21 Nis

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

English

125

605

4.5K

1.1M

Roman retweetledi

OpenAI@OpenAI·18 Mar

Are you up for a challenge? openai.com/parameter-golf

English

384

291

4.2K

1.4M

Roman retweetledi

Peter Hase@peterbhase·17 Mar

New Schmidt Sciences RFP on AI Interpretability: We need new tools for detecting and mitigating deceptive behaviors exhibited by LLMs. Funding for $300k-$1M projects Deadline: May 26th, AoE RFP: schmidtsciences.smapply.io/prog/2026_inte… Please share with anyone who may be interested!

English

175

12.7K

Roman@romansvet·19 Mar

So true. And It's down. Again.

Alisa | Artisans代表 | 着物を持続可能な産業に進化させる人👘@alisa_ontheway

Claude is now too popular...it's down every day...🥲

English

Roman retweetledi

Daniel Han@danielhanchen·10 Mar

If you find Claude Code with local models to be 90% slower, it's because CC prepends some attribution headers, and this changes per message causing it to invalidate the entire prompt cache / KV cache. So generation becomes O(N^2) not O(N) for LLMs.

Unsloth AI@UnslothAI

Note: Claude Code invalidates the KV cache for local models by prepending some IDs, making inference 90% slower. See how to fix it here: #fixing-90-slower-inference-in-claude-code" target="_blank" rel="nofollow noopener">unsloth.ai/docs/basics/cl…

English

134

1.6K

175.2K

Roman retweetledi

Varun@varun_mathur·9 Mar

connecting autoresearcher agents in a volatile and hostile peer-to-peer network has led to intelligence emerging which no agent individually had. this, extrapolated across all domains, and across 100s of millions of nodes, is the true path to agi. agents.hyper.space/research-report

English

7.5K

Roman@romansvet·25 Şub

@DavidStOnge @jpssff @ollama @Alibaba_Qwen @Ali_TongyiLab I got it running with 0.17.1-rc, but it feels like the model doesn't want to continue the conversation. It keeps repeating the response to the first message

English

David St-Onge@DavidStOnge·25 Şub

@jpssff @ollama @Alibaba_Qwen @Ali_TongyiLab I had to uninstall version 0.17.0, reboot my computer, and go directly on github to download and install 0.17.1-rc1 to make it work.

English

290

ollama@ollama·25 Şub

Qwen 3.5 family is here! > vision built-in, and can outperform previous VL models > designed to be more efficient > expanded support for more languages 35B: (fits on 24GB+ system) ollama run qwen3.5:35b 122B: ollama run qwen3.5:122b 397B (cloud only): ollama run qwen3.5:397b-cloud

Qwen@Alibaba_Qwen

🚀 Introducing the Qwen 3.5 Medium Model Series Qwen3.5-Flash · Qwen3.5-35B-A3B · Qwen3.5-122B-A10B · Qwen3.5-27B ✨ More intelligence, less compute. • Qwen3.5-35B-A3B now surpasses Qwen3-235B-A22B-2507 and Qwen3-VL-235B-A22B — a reminder that better architecture, data quality, and RL can move intelligence forward, not just bigger parameter counts. • Qwen3.5-122B-A10B and 27B continue narrowing the gap between medium-sized and frontier models — especially in more complex agent scenarios. • Qwen3.5-Flash is the hosted production version aligned with 35B-A3B, featuring: – 1M context length by default – Official built-in tools 🔗 Hugging Face: huggingface.co/collections/Qw… 🔗 ModelScope: modelscope.cn/collections/Qw… 🔗 Qwen3.5-Flash API: modelstudio.console.alibabacloud.com/ap-southeast-1… Try in Qwen Chat 👇 Flash: chat.qwen.ai/?models=qwen3.… 27B: chat.qwen.ai/?models=qwen3.… 35B-A3B: chat.qwen.ai/?models=qwen3.… 122B-A10B: chat.qwen.ai/?models=qwen3.… Would love to hear what you build with it.

English

103

140K

Roman@romansvet·23 Şub

I don’t know what kind of tasks others are using GPT-5.3-Codex for or what prompts they’re using, but what I’m trying works much worse than before. It feels like the model is looking for ways to save time. Now I’m adding instructions that the model has no time limits.

English

Roman retweetledi

Dimitris Papailiopoulos@DimitrisPapail·19 Şub

x.com/i/article/2024…

ZXX

194

1.6K

493.3K

Roman retweetledi

RoboParty@RoboParty_DEV·11 Şub

x.com/i/article/2021…

ZXX

173

14.9K

Roman retweetledi

alex zhang@a1zhang·3 Oca

Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! arxiv.org/pdf/2512.24601

English

252

1.1K

7.4K

Roman retweetledi

Sean VanWinkle@SeanV6790·5 Ara

1/ I just open-sourced HMLR — the first memory system that passes every impossible test at 1.00/1.00 on gpt-4.1-mini. No 128k context. <4k tokens average. github.com/Sean-V-Dev/HML… @elonmusk @karpathy @langchain @LlamaIndex @hwchase17 @andrew_ng @gwern @repl_it

English

579

74.4K

Roman retweetledi

David Wall@DavidWall9987·27 Kas

Scaling evolution strategies to hyperscale is impressive — but it exposes the same pattern we’re hitting across every training paradigm: we can scale the optimization method faster than we can scale the system’s stability. EGGROLL removes the backprop bottleneck and gives you massive throughput gains, but the failure modes of large models don’t come from slow gradients. They come from: • representation drift • unstable long-horizon behavior • loss landscapes that amplify noise • fragile coordination across layers • lack of structural anchors during training You can update parameters with rank-one, rank-k, or full-rank perturbations — the math changes the speed, but the architecture still determines whether the model stays coherent as it grows to billions or trillions of parameters. Optimization isn’t the real ceiling. Integrity is. Faster training helps, but without machine-native designs that manage entropy at scale, we’re just accelerating our way into the same instability faster.

English

950

Roman retweetledi

Bidipta Sarkar@bidiptas13·21 Kas

Introducing 🥚EGGROLL 🥚(Evolution Guided General Optimization via Low-rank Learning)! 🚀 Scaling backprop-free Evolution Strategies (ES) for billion-parameter models at large population sizes ⚡100x Training Throughput 🎯Fast Convergence 🔢Pure Int8 Pretraining of RNN LLMs

English

146

949

263.6K

Roman retweetledi

Akshay 🚀@akshay_pachaar·14 Kas

Meta just solved the biggest problem in RAG! Most RAG systems waste your money. They retrieve 100 chunks when you only need 10. They force the LLM to process thousands of irrelevant tokens. You pay for compute you don't need. Meta AI just solved this. They built REFRAG, a new RAG approach that compresses and filters context before it hits the LLM. The results are insane: - 30.85x faster time-to-first-token - 16x larger context windows - 2-4x fewer tokens processed - Outperforms LLaMA on 16 RAG benchmarks Here's what makes REFRAG different: Traditional RAG dumps everything into the LLM. Every chunk. Every token. Even the irrelevant stuff. REFRAG works at the embedding level instead: ↳ It compresses each chunk into a single embedding ↳ An RL-trained policy scores each chunk for relevance ↳ Only the best chunks get expanded and sent to the LLM ↳ The rest stay compressed or get filtered out entirely The LLM only processes what matters. The workflow is straightforward: 1. Encode your docs and store them in a vector database 2. When a query arrives, retrieve relevant chunks as usual 3. The RL policy evaluates compressed embeddings and picks the best ones 4. Selected chunks are expanded into full token embeddings 5. Rejected chunks stay as single compressed vectors 6. Everything goes to the LLM together This means you can process 16x more context at 30x the speed with zero accuracy loss. I have shared link to the paper in the next tweet!

English

279

1.4K

103.3K

Roman retweetledi

Wenhu Chen@WenhuChen·24 Eki

Had some really interesting discoveries recently: If a model performs extremely stable on one benchmark. Let's say a model is always getting 62% on SWEBench no matter what prompts or scaffold you used. It DOES NOT mean that the model is robust. It actually means that the model is CONTAMINATED on Swebench, i.e. directly train on the test set or the paraphrase of the test set. This could possibly become a good metric for detecting contamination. We will provide more empirical results later on.

English

252

39.1K

Roman retweetledi

Robert Youssef@rryssf·13 Eki

I just read this new paper that completely broke my brain 🤯 Researchers figured out how to transfer LoRA adapters between completely different AI models without any training data, and it works better than methods that require massive datasets. It's called TITOK, and here's the wild part: Instead of copying everything from the source model, they only transfer the tokens that actually matter. They do this by comparing the model with and without LoRA to find where the adapter adds real value. Think of it like this: if your tuned model is confident about a token but the base model isn't, that token contains the knowledge you want to transfer. The results are insane: +8% better than vanilla models +6% better than traditional knowledge distillation +4.4% better than TransLoRA And it works across wild scenarios: → Mistral to Llama (different families) → 3B to 8B models (different sizes) → Llama 2 to Llama 3 (different versions) The killer advantage? No extra models needed. TransLoRA requires training a separate discriminator just to filter synthetic data. TITOK uses the source model itself to identify important tokens. Even crazier: they handle different tokenizers automatically. When models split text differently, their algorithm aligns the tokens and propagates importance scores across the gap. This isn't just academic. Every time a new model drops, your fine-tuned adapters become obsolete. TITOK means you can migrate that hard-won knowledge to any new backbone in hours, not weeks. We just went from "each model needs its own adapter" to "knowledge flows freely between models." The efficiency gains alone make this commercially critical. But the real breakthrough is proving you can transfer reasoning capabilities through selective token-level supervision. Paper: arxiv. org/abs/2510.04682

English

108

477

42.5K

Roman@romansvet·9 Eki

Looking nice!

Figure@Figure_robot

Introducing Figure 03

English

Roman retweetledi

Rohan Paul@rohanpaul_ai·24 Ağu

Really cool idea in this paper 💡 They propose intelligence works by reusing stored inference loops, not recomputing every time. It means the system keeps a memory of how it solved problems before, then reuses that stored know-how instead of solving from scratch every time. It calls this Memory-Amortized Inference (MAI), where past solutions live as loops and get adapted to new inputs. The core claim is nonergodicity, meaning the agent does not roam everywhere evenly, it keeps returning to useful regions. “Stored inference loops” are repeatable internal routines, a short sequence of states the model can re-enter that already sits near a good answer, so when a new input arrives it pulls the closest loop from memory and makes a small correction to fit the new case. MAI runs a 2 step cycle, retrieval pulls a similar past state from memory, then a small update nudges it to fit the current context. Because the path closes back on itself like a loop, the internal state stays consistent over time and avoids drift. This cuts compute, keeps behavior stable, and bakes in useful bias toward structures that worked before, think of it like reusing a trusted playbook and tweaking a couple of steps rather than writing a new one. This reuse works like a built in preference for simple explanations, which shrinks uncertainty and cuts compute versus full retraining. There is also a time link to Reinforcement Learning, which pushes value forward from rewards, while MAI reconstructs causes backward from memory. So planning can run forward, and inference can run backward, with both sides bootstrapping from partial information. The authors map this onto cortical columns, with feedforward pathways doing the updates and feedback pathways doing the retrieval, matching predictive coding style loops. The practical takeaway is energy efficient inference, store stable computation loops, start each problem near them, then make a tiny correction. ---- Paper – arxiv. org/abs/2508.14143 Paper Title: "Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation"

English

339

24K

Roman retweetledi

Jack 🤖@JacklouisP·21 May

Designing a humanoid robot isn't merely about replicating the human form. It's about understanding the rationale behind each joint and selectively choosing what to replicate. What's the best way to do this? First principles thinking. Here's a first principles approach to mechanical design in humanoid robotics, on the Soft Robotics Podcast: @GoingBallistic5 @MarwaEldiwiny

English

187

203.7K

Keşfet

@huggingface @DavidStOnge @jpssff @ollama @Alibaba_Qwen @Ali_TongyiLab @elonmusk @karpathy