Thinh

26 posts

Thinh

@thinhphp_vt

PhD student @VT_CS, supervised by @tuvllms. Interested in search-augmented LLMs. Ex AI resident @VinAI_Research

Blacksburg, VA Inscrit le Temmuz 2023

598 Abonnements94 Abonnés

Thinh retweeté

Tu Vu@tuvllms·26 Mar

Very impressive results from Chroma Context-1, which push the Pareto frontier of agentic search on our SealQA (Seal-0) & LongSealQA benchmarks. Check out their techical report below 👇 trychroma.com/research/conte…

Chroma@trychroma

Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source

English

Thinh@thinhphp_vt·26 Oca

🔥Our paper "SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models" has been accepted to #ICLR 2026!! 🎉🎉🎉 Huge thanks to my supervisor @tuvllms and the other co-authors for all your hard work! See you in Brazil ✈️

English

1.8K

Thinh retweeté

alex zhang@a1zhang·3 Oca

Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! arxiv.org/pdf/2512.24601

English

253

1.1K

7.4K

Thinh retweeté

Kimi.ai@Kimi_Moonshot·6 Kas

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: platform.moonshot.ai 🔗 Tech blog: moonshotai.github.io/Kimi-K2/thinki… 🔗 Weights & code: huggingface.co/moonshotai

English

581

1.5K

9.7K

4.8M

Thinh retweeté

Sentient@SentientAGI·5 Eyl

Announcing ROMA (Recursive Open Meta Agent): our new multi-agent framework that sets SOTA in reasoning + search. Seal-0: 45.6% FRAMES: 81.7% SimpleQA: 93.9% 🧵 Read more about how recursive coordination lets agents tackle complex queries.

English

764

598

2.4K

505K

Thinh retweeté

Rohan Paul@rohanpaul_ai·6 Eyl

OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses. The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing. OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower. Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess. Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers. 🧵 Read on 👇

English

327

2.4K

371.5K

Thinh retweeté

Ken Liu@kenziyuliu·26 Ağu

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

English

370

67.1K

Thinh@thinhphp_vt·22 Ağu

DeepSeek achieved a strong result on SEAL0, a challenging benchmark for reasoning with conflicting search results. 🎊

DeepSeek@deepseek_ai

Tools & Agents Upgrades 🧰 📈 Better results on SWE / Terminal-Bench 🔍 Stronger multi-step reasoning for complex search tasks ⚡️ Big gains in thinking efficiency 3/5

English

259

Thinh retweeté

Tu Vu@tuvllms·20 Ağu

Excited to share that our paper on efficient model development has been accepted to #EMNLP2025 Main conference @emnlpmeeting. Congratulations to my students @linusdd44804 and @Sub_RBala on their first PhD paper! 🎉

Tu Vu@tuvllms

🚨 New paper 🚨 Excited to share my first paper w/ my PhD students!! We find that advanced LLM capabilities conferred by instruction or alignment tuning (e.g., SFT, RLHF, DPO, GRPO) can be encoded into model diff vectors (à la task vectors) and transferred across model versions. 💡You don’t necessarily need to fine-tune from scratch again for every new base model version. Instead, fine-tune once and add the diff vector to updated versions! ♻️♻️♻️. This can also offer a stronger and more computationally efficient starting point when further training is feasible. 📰: tinyurl.com/finetuning-tra… More 👇

English

5.6K

Thinh retweeté

basvanopheusden@basvanopheusden·13 Ağu

A few weeks ago, I started a new job at @OpenAI. I wrote a document about my interview process and recommendations for anyone on the job market for AI research positions. I hope it's helpful! docs.google.com/document/d/1ZV…

English

343

4.1K

335.7K

Thinh retweeté

Sheryl Hsu@SherylHsu02·11 Ağu

1/n I’m thrilled to share that our @OpenAI reasoning system scored high enough to achieve gold 🥇🥇 in one of the world’s top programming competitions - the 2025 International Olympiad in Informatics (IOI) - placing first among AI participants! 👨‍💻👨‍💻

English

198

288

2.7K

2.5M

Thinh retweeté

Intelligent Internet@ii_posts·5 Ağu

Most search models need the cloud. II-Search-4B doesn’t. 4B model tuned for reasoning with search tools, built for local use. Performance of models 10x its size. Search that is small, smart, and open.

English

106

647

499.9K

Thinh@thinhphp_vt·5 Ağu

🥳Congrats @ii_posts for an impressive result on SEAL-0, a challenging benchmark for search-augmented LLMs. 🤩Looking forward to the evaluation standards it shapes in this field. 📚Read more: arxiv.org/abs/2506.01062

Intelligent Internet@ii_posts

Overall Results:

English

178

Thinh retweeté

Peter H. Diamandis, MD@PeterDiamandis·24 Tem

. @EMostaque came back on the show to chat about: --how we can't compete against AI agents --his solution for a POSITIVE AI world --Why UBI won't work but UBAI might.. --we need to be focused on incentivizing the right outcomes -- Nations need sovereign AI stacks or they'll be left behind by the mega-models

English

244

27.5K

Thinh retweeté

Jasper Dekoninck@j_dekoninck·17 Tem

We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)

English

218

37.3K

Thinh@thinhphp_vt·14 Tem

We just evaluated Grok 4 on our SEAL-0 dataset 👍Try it: huggingface.co/datasets/vtllm…

English

3.1K

Thinh retweeté

Sukjun (June) Hwang@sukjun_hwang·11 Tem

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

GIF

English

737

4.7K

786.6K

Thinh@thinhphp_vt·4 Tem

🔥 SEAL-0 Leaderboard 📈 Our results on SEAL-0 show a large room for improvement in LLMs' ability to reason over conflicting evidence. 🤯 👉Checkout our paper: arxiv.org/abs/2506.01062 👉Dataset: huggingface.co/datasets/vtllm…

English

2.1K

Découvrir

@tuvllms @emnlpmeeting @linusdd44804 @Sub_RBala @OpenAI @ii_posts @EMostaque @elonmusk