erenup (@erenup1) - Twitter Profili | Zamantika Mersobahis Locabet

erenup@erenup1·12 Nis

@steipete @openclaw Antibot pass rate is another dimension for real world tasks. LOl

English

0

19

erenup@erenup1·12 Nis

@steipete @openclaw We recently built a claw-bench and found gpt-5.4 has a long way to go/be improved in order to complete realistic web tasks. huggingface.co/papers/2604.08… Failed GPT-5.4 traces/recordings/screenshots were also visible in #trace/001-daily-life-food-uber-eats-gpt-5.4-2026-03-05-20260329-110905" target="_blank" rel="nofollow noopener">claw-bench.com/#trace/001-dai…

English

1

0

2

280

Peter Steinberger 🦞@steipete·12 Nis

Two experiments in the next @openclaw to address some "GPT is lazy" issues: 1) Strict mode: agents.defaults.embeddedPi.executionContract = "strict-agentic" This tells GPT-5.x to keep working: read more code, call tools, make changes, or return a real blocker instead of stopping at “here’s the plan.” docs.openclaw.ai/providers/open…

English

184

132

2.4K

399.6K

erenup retweetledi

Aran Komatsuzaki@arankomatsuzaki·10 Nis

ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle—dropping from ~70% on sandbox benchmarks to as low as 6.5% here.

English

9

16

105

45.8K

erenup retweetledi

Wenhu Chen@WenhuChen·10 Nis

Super excited to share our ClawBench to test real-world tasks. Check out our website at claw-bench.com

Aran Komatsuzaki@arankomatsuzaki

ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle—dropping from ~70% on sandbox benchmarks to as low as 6.5% here.

English

1

7

63

22.7K

erenup retweetledi

Zhuofeng Li@zhuofengli96475·24 Mar

🚀 OpenResearcher paper is finally released! 🔥 We explore how to synthesize long-horizon research trajectories for deep-research agents — fully offline, scalable, and low-cost, without relying on live web APIs. 📄 huggingface.co/papers/2603.20… 🧩Two key ideas: Offline Corpus — One-time bootstrapping seeds 10K gold passages + 15M-doc FineWeb corpus. 📚 Explicit Browsing Primitives — Just 3 ops: search / open / find. The agent learns not just what to retrieve, but how to inspect docs and localize evidence at multiple scales. 🔎 📊 Results: 54.8% on BrowseComp-Plus with our 30B-A3B — #1 open-source under the same search engine setup. Beating much larger models like GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, and DeepSeek-R1. 💡 Insights: Beyond accuracy, we dissect deep research pipeline design—from data filtering and agent configuration to retrieval accuracy dynamics (RQ1-RQ5). Try it yourself: 🛠️ Code: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

Dongfu Jiang@DongfuJiang

🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

English

11

60

309

46.2K

erenup retweetledi

Dongfu Jiang@DongfuJiang·9 Şub

🚀 Introducing OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories—no search/scrape APIs, no rate limits, no nondeterminism. 💡 We use GPT-OSS-120B + a local retriever + a 10T-token corpus to generate long-horizon tool-use traces (search → open → find) that look like real browsing, but are free + reproducible. 📈 The payoff: SFT on these trajectories turns Nemotron-3-Nano-30B-A3B from 20.8% → 54.8% accuracy on BrowseComp-Plus (+34.0). 🧩 What makes it work? 🔎 Offline corpus = 15M FineWeb docs + 10K “gold” passages (bootstrapped once) 🧰 Explicit browsing primitives = better evidence-finding than “retrieve-and-read” 🎯 Reject sampling = keep only successful long-horizon traces 🧵 And we’re releasing everything: ✅ code + search engine + corpus recipe ✅ 96K-ish trajectories + eval logs ✅ trained models + live demo 👨‍💻 GitHub: github.com/TIGER-AI-Lab/O… 🤗 Models & data: huggingface.co/collections/TI… 🚀 Demo: huggingface.co/spaces/OpenRes… 🔎 Eval logs: huggingface.co/datasets/OpenR… #llms #agentic #deepresearch #tooluse #opensource #retrieval #SFT

English

30

207

1.3K

145.3K

erenup@erenup1·10 Kas

@PMinervini @emnlpmeeting @yuzhaouoe @devoto_alessio @_joestacey_ 🎉🎉🎉🥳

QME

0

41

Pasquale Minervini@PMinervini·10 Kas

My amazing collaborators will be presenting three papers next week at EMNLP 2024! (@emnlpmeeting ) -- I wrote a blog post about our EMNLP papers and some of the other projects we're brewing 🚀🙂 neuralnoise.com/2024/nov-resea…, with @yuzhaouoe @devoto_alessio @_joestacey_

English

2

8

61

4.1K

erenup@erenup1·5 Eki

@PMinervini @emnlpmeeting @devoto_alessio @yuzhaouoe @s_scardapane @_joestacey_ @oanacamb @MarekRei Congrats on all amazing work!

English

0

4

98

Pasquale Minervini@PMinervini·20 Eyl

My amazing collaborators will be presenting 3/3 papers at EMNLP 2024!🚀🚀🚀 @emnlpmeeting Crazy simple KV Cache Compression: arxiv.org/abs/2406.11430 @devoto_alessio @yuzhaouoe @s_scardapane Explainable Neuro-Symbolic NLI: arxiv.org/abs/2305.13214 @_joestacey_ @oanacamb @MarekRei Mixtures of Retrieval Augmented Generators: [COMING SOON!] @erenup1

English

4

12

82

9.4K

erenup@erenup1·5 Eki

@srush_nlp Dear professor Rush, this is a great retrieval model! I’d like to support the training at scale with few gpu machines. Please contact me if i can help. Thank you.

English

0

2

221

Sasha Rush@srush_nlp·4 Eki

Jack got obsessed with what a neural version TF/IDF would be. He came up with an elegant solution. (If you have GPUs we would love to try it at scale.)

Jack Morris@jxmnop

We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world. today, we're releasing the model on HuggingFace, along with the paper on ArXiv. I think our release marks a paradigm shift for text retrieval. let me tell you why👇

English

6

42

535

62.7K

erenup@erenup1·30 Mar

@winglian I am testing dbrx on some benchmark and find it’s really good. I may be able to share some compute from my side. Does one node of 8x80 H100 or A100 w/o nvlink enough for this?

English

0

1

95

Wing Lian (caseus)@winglian·29 Mar

Basic Qwen2 MoE LoRA support including multipack is now in Axolotl. Can you all take a break from releasing new models please? 😅 Or at least coordinate and stagger them a bit further apart. DBRX is next on the list, but it's large enough that a compute sponsor would be nice. 🙏🏽

English

6

2

60

5.5K

erenup@erenup1·21 Mar

@virattt Could you please share same experiments on huggingface open sourced command-r model and open sourced embeding models and rerank model? thank you!

English

0

440

Virat Singh@virattt·21 Mar

Financial RAG Evaluation 🕵️ I added reranking to the pipeline today. As expected, command-r performed even better. Main takeaways: • command-r excels at RAG • cohere reranking is seriously fast • gpt-3.5 slow at reranking, fine without Experiment setup: • included reranking • improved prompts • evaluated answer correctness • measured RAG pipeline speed For answer correctness, I used ragas. The final score is avg of gpt-4 and opus scores. For speed, I used avg time of end-to-end RAG pipeline including vector DB query, reranking, answering. Given that scoring is done by LLMs, the output is probabilistic. Trend and range more important than specific number. Dataset details: • 100 questions on Airbnb 2023 10-K • synthetically generated using ragas I will release public version of dataset soon. For now, can generate using my colab code. Upcoming experiments: • haiku RAG pipeline • mistral RAG pipeline • multiple 10-Ks • chunk optimization • function calling • what else? Really cool to see that command-r is both faster and better than its counterpart at RAG. Nice work @cohere

English

9

30

241

43.7K

erenup@erenup1·19 Mar

@AkariAsai thank you very much for this reference.

English

0

33

Akari Asai@AkariAsai·19 Mar

@erenup1 We discuss recent work conducting "retrieval-aware" instruction-tuning in the paper as one of the promising directions to further advance RAG! This line includes other work such as RA-DIT arxiv.org/abs/2310.01352, SAIL arxiv.org/abs/2305.15225 and Self-RAG arxiv.org/abs/2310.11511

English

1

2

210

Akari Asai@AkariAsai·18 Mar

Recently I gave a lecture about retrieval-augmented LMs like RAG, covering their advantages, an overview of diverse methods, and current limitations & opportunities, based on this position paper. akariasai.github.io/assets/pdf/aka… video: shorturl.at/ahmq8 Feedback is welcomed :)

Akari Asai@AkariAsai

𝗛𝗼𝘄 𝗰𝗮𝗻 𝘄𝗲 𝗯𝘂𝗶𝗹𝗱 𝗺𝗼𝗿𝗲 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗟𝗠-𝗯𝗮𝘀𝗲𝗱 𝘀𝘆𝘀𝘁𝗲𝗺𝘀? Our new position paper advocates for retrieval-augmented LMs (RALMs) as the next gen. of LMs, exploring the promises, limitations, and a roadmap for wider adoption. arxiv.org/abs/2403.03187 🧵

English

6

51

281

37K

erenup@erenup1·15 Mar

@bclavie how about end to end examples from long chain to tanker to command r and json mode ? Answer with ground reference in structure way?

English

0

86

Ben Clavié@bclavie·14 Mar

Document reranking is powerful, but daunting to get started with. Moreover, trying a new approach requires modifying your pipeline, even though it does the same thing! Introducing 🔧rerankers: a lightweight library to provide a unified way to use various reranking methods🧵1/?

English

18

59

420

91.9K

erenup@erenup1·23 Şub

@osanseviero @huggingface How can I get faster speed of this by using gpu resources in hf? Thank you.

English

0

25

Omar Sanseviero@osanseviero·22 Şub

Gemma with @huggingface serverless API and using OpenAI Messages API. No need to format your prompts anymore! gist.github.com/osanseviero/a1…

English

4

26

136

13K

erenup@erenup1·21 Şub

@carrigmat @huggingface Do you know the link to your ifeval score? Thank you. Ifeval has 4 settings. I’d like to know which settings he is using.

English

1

0

1

147

Matthew Carrigan@carrigmat·20 Şub

Hey! Are you using chat models on @huggingface like: - LLaMA - Mi(s/x)tral - Falcon - Zephyr - Phi Do you want massive performance gains? Then you should be using chat templates! The guide is here: huggingface.co/docs/transform… (Thanks to Daniel Furman for the table)

English

7

49

340

48.2K

erenup@erenup1·9 Şub

@PMinervini I have tried some tricky multilingual cases. It seems it can understand!

English

0

1

42

Pasquale Minervini@PMinervini·9 Şub

finally a truly responsible and provably safe AI model! goody2.ai

English

5

0

18

1.8K

erenup@erenup1·31 Oca

2. I have also implemented a simple triton backend for your fast adaption. See this pr for more details: github.com/triton-inferen…

English

0

122

erenup@erenup1·31 Oca

Super Excited to contribute some old-style (Bert/Roberta) model features to Nvidia Tensorrt-LLM. You can leverage Tensorrt-LLM to accelerate your NLP pipelines if you are using xlm-roberta/roberta/bert etc.

English

2

0

168

erenup@erenup1·31 Oca

1. The original PR is here so that you can learn more about the difference between roberta and bert. github.com/NVIDIA/TensorR…

English

0

49

erenup

Keşfet