erenup

27 posts

erenup

erenup

@erenup1

Keep running

Katılım Temmuz 2019
198 Takip Edilen12 Takipçiler
erenup
erenup@erenup1·
@steipete @openclaw Antibot pass rate is another dimension for real world tasks. LOl
English
0
0
0
19
erenup
erenup@erenup1·
@steipete @openclaw We recently built a claw-bench and found gpt-5.4 has a long way to go/be improved in order to complete realistic web tasks. huggingface.co/papers/2604.08… Failed GPT-5.4 traces/recordings/screenshots were also visible in #trace/001-daily-life-food-uber-eats-gpt-5.4-2026-03-05-20260329-110905" target="_blank" rel="nofollow noopener">claw-bench.com/#trace/001-dai…
English
1
0
2
280
Peter Steinberger 🦞
Peter Steinberger 🦞@steipete·
Two experiments in the next @openclaw to address some "GPT is lazy" issues: 1) Strict mode: agents.defaults.embeddedPi.executionContract = "strict-agentic" This tells GPT-5.x to keep working: read more code, call tools, make changes, or return a real blocker instead of stopping at “here’s the plan.” docs.openclaw.ai/providers/open…
English
184
132
2.4K
399.6K
erenup retweetledi
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle—dropping from ~70% on sandbox benchmarks to as low as 6.5% here.
Aran Komatsuzaki tweet media
English
9
16
105
45.8K
erenup retweetledi
Zhuofeng Li
Zhuofeng Li@zhuofengli96475·
🚀 OpenResearcher paper is finally released! 🔥 We explore how to synthesize long-horizon research trajectories for deep-research agents — fully offline, scalable, and low-cost, without relying on live web APIs. 📄 huggingface.co/papers/2603.20… 🧩Two key ideas: Offline Corpus — One-time bootstrapping seeds 10K gold passages + 15M-doc FineWeb corpus. 📚 Explicit Browsing Primitives — Just 3 ops: search / open / find. The agent learns not just what to retrieve, but how to inspect docs and localize evidence at multiple scales. 🔎 📊 Results: 54.8% on BrowseComp-Plus with our 30B-A3B — #1 open-source under the same search engine setup. Beating much larger models like GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, and DeepSeek-R1. 💡 Insights: Beyond accuracy, we dissect deep research pipeline design—from data filtering and agent configuration to retrieval accuracy dynamics (RQ1-RQ5). Try it yourself: 🛠️ Code: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT
Zhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet mediaZhuofeng Li tweet media
Dongfu Jiang@DongfuJiang

🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

English
11
60
309
46.2K
erenup retweetledi
Dongfu Jiang
Dongfu Jiang@DongfuJiang·
🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT
Dongfu Jiang tweet media
English
30
207
1.3K
145.3K
erenup
erenup@erenup1·
@srush_nlp Dear professor Rush, this is a great retrieval model! I’d like to support the training at scale with few gpu machines. Please contact me if i can help. Thank you.
English
0
0
2
221
erenup
erenup@erenup1·
@winglian I am testing dbrx on some benchmark and find it’s really good. I may be able to share some compute from my side. Does one node of 8x80 H100 or A100 w/o nvlink enough for this?
English
0
0
1
95
Wing Lian (caseus)
Wing Lian (caseus)@winglian·
Basic Qwen2 MoE LoRA support including multipack is now in Axolotl. Can you all take a break from releasing new models please? 😅 Or at least coordinate and stagger them a bit further apart. DBRX is next on the list, but it's large enough that a compute sponsor would be nice. 🙏🏽
Wing Lian (caseus) tweet media
English
6
2
60
5.5K
erenup
erenup@erenup1·
@virattt Could you please share same experiments on huggingface open sourced command-r model and open sourced embeding models and rerank model? thank you!
English
0
0
0
440
Virat Singh
Virat Singh@virattt·
Financial RAG Evaluation 🕵️ I added reranking to the pipeline today. As expected, command-r performed even better. Main takeaways: • command-r excels at RAG • cohere reranking is seriously fast • gpt-3.5 slow at reranking, fine without Experiment setup: • included reranking • improved prompts • evaluated answer correctness • measured RAG pipeline speed For answer correctness, I used ragas. The final score is avg of gpt-4 and opus scores. For speed, I used avg time of end-to-end RAG pipeline including vector DB query, reranking, answering. Given that scoring is done by LLMs, the output is probabilistic. Trend and range more important than specific number. Dataset details: • 100 questions on Airbnb 2023 10-K • synthetically generated using ragas I will release public version of dataset soon. For now, can generate using my colab code. Upcoming experiments: • haiku RAG pipeline • mistral RAG pipeline • multiple 10-Ks • chunk optimization • function calling • what else? Really cool to see that command-r is both faster and better than its counterpart at RAG. Nice work @cohere
Virat Singh tweet media
English
9
30
241
43.7K
erenup
erenup@erenup1·
@AkariAsai thank you very much for this reference.
English
0
0
0
33
erenup
erenup@erenup1·
@bclavie how about end to end examples from long chain to tanker to command r and json mode ? Answer with ground reference in structure way?
English
0
0
0
86
Ben Clavié
Ben Clavié@bclavie·
Document reranking is powerful, but daunting to get started with. Moreover, trying a new approach requires modifying your pipeline, even though it does the same thing! Introducing 🔧rerankers: a lightweight library to provide a unified way to use various reranking methods🧵1/?
Ben Clavié tweet media
English
18
59
420
91.9K
erenup
erenup@erenup1·
@carrigmat @huggingface Do you know the link to your ifeval score? Thank you. Ifeval has 4 settings. I’d like to know which settings he is using.
English
1
0
1
147
Matthew Carrigan
Matthew Carrigan@carrigmat·
Hey! Are you using chat models on @huggingface like: - LLaMA - Mi(s/x)tral - Falcon - Zephyr - Phi Do you want massive performance gains? Then you should be using chat templates! The guide is here: huggingface.co/docs/transform… (Thanks to Daniel Furman for the table)
Matthew Carrigan tweet media
English
7
49
340
48.2K
erenup
erenup@erenup1·
@PMinervini I have tried some tricky multilingual cases. It seems it can understand!
English
0
0
1
42
erenup
erenup@erenup1·
2. I have also implemented a simple triton backend for your fast adaption. See this pr for more details: github.com/triton-inferen…
English
0
0
0
122
erenup
erenup@erenup1·
Super Excited to contribute some old-style (Bert/Roberta) model features to Nvidia Tensorrt-LLM. You can leverage Tensorrt-LLM to accelerate your NLP pipelines if you are using xlm-roberta/roberta/bert etc.
erenup tweet media
English
2
0
0
168