Joschka Strueber @ Tuebingen AI Center🇪🇺

36 posts

Joschka Strueber @ Tuebingen AI Center🇪🇺

@JoschkaStrueber

PhD student in https://t.co/U3NlKbAHiM at @uni_tue and @MPI_IS (IMPRS-IS). LLM multi-turn post-training and evaluations.

Katılım Şubat 2025

90 Takip Edilen28 Takipçiler

Sabitlenmiş Tweet

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·13 Şub

🚨🚨🚨 The second full paper of my PhD is done. We introduce a scalable approach for credit assignment in multi-turn information seeking called ΔBelief-RL. More stable training and better generalization than outcome-based GRPO. 🚨🚨🚨

Ilze Amanda Auzina@AmandaIlze

How can agents learn in long, open-ended tasks where success is rare and rewards are sparse? 👀 🚨 Enter ∆Belief-RL: we show how to use agent’s own belief updates as a dense reward for turn-level credit assignment. the result? Surprisingly strong generalization. (1/8) 🧵⬇️

English

192

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Patrik Reizinger@rpatrik96·21 Nis

Unsupervised skill discovery methods have achieved remarkable improvement in the last few years. Why is that the case? In our @iclr_conf paper, we use identifiability theory to explain it, by proving that Contrastive Successor Features learns the ground truth states. A thread.

English

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·18 Nis

If you are a historian working with primary sources, or know someone who does, we'd appreciate you taking or sharing this. If you don't, we get it. Surveys suck. #twitterstorians #digitalhumanities #AcademicChatter #archives #OAH2026

English

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·18 Nis

We're running a short (~10 min) survey to map these problems more systematically. Which archives do you use? What languages do your sources appear in? Where does discovery actually get stuck? Your answers directly shape what we build next. 🔗 tally.so/r/ja7qN9

English

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·18 Nis

If you've ever spent days searching five archives in three languages only to come up empty: we want to understand that better. We're researchers at the Tübingen AI Center studying how historians find primary sources and where the process breaks down. 🔗 archivum.umso.co

English

247

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Babak Rahmani@babakRmni·25 Şub

🧵Debugging Code World Models A few months ago we started studying CWMs. The plan was post-training an LLM on code execution traces. Two weeks in, we realised a paper by Meta had already done much of this : arxiv.org/pdf/2510.02387. We however identified what's wrong with them!

English

996

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Hardik Bhatnagar@hrdkbhatnagar·17 Ara

Excited to share PostTrainBench - a project I've been working on with @full__rank and @maksym_andr We built a benchmark to measure how well agents can post-train base LLMs. Think of it as a test for AI R&D automation 🔗 posttrainbench.com 📂 github.com/aisa-group/Pos… 1/n

English

816

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

hallerite@hallerite·17 Ara

Happy to finally share what I have been working on for some time now. Introducing »Ludic« – an LLM-RL library for the era of experience. While there are now a lot of LLM-RL codebases, even many good ones, I want to share my very idiosyncratic way to think about LLM-RL.

English

267

22K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·17 Ara

@dhtikna @FanqingMengAI Yes, you are 100% right. I was thinking too much about use cases in my computationally-constrained setting where distilling from a larger model is feasible, but training it isn't.

English

Ankith 🐋/acc@dhtikna·17 Ara

@JoschkaStrueber @FanqingMengAI True but in this case the models are same size, thats why i thought it might be fair to sft is most likely cheaper

English

106

Fanqing Meng@FanqingMengAI·16 Ara

I think it is not new.... In dpsk 3.2, they use expert rl -> joint sft -> joint rl in it. In longcat, they use expert rl -> model mr -> joint rl in it. mimo replace sft with opd. everyone know opd is better than sft :)

English

134

72K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·17 Ara

@dhtikna @FanqingMengAI If the model you train is smaller than the model you distill, OPD is usually much cheaper than SFT. The cost to sample off-policy data should be much higher than using the same bigger model to compute logprobs on on-policy data samples by the cheaper student.

English

129

Ankith 🐋/acc@dhtikna·17 Ara

@FanqingMengAI I think one counter point is that the on policy distillation just takes more compute and doesnt produce better results than deepseek's SFT+ loss recovering RL which may be much cheaper (as SFT part is off policy). So its not worth the engineering complexity plus the extra compute

English

4.5K

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Wieland Brendel@wielandbr·15 Ara

🚀 @ELLISInst_Tue will lead the AI development for Germany’s new open-source nationwide Adaptive Intelligent System (AIS) learning platform for schools – and we’re #hiring! We are part of a national consortium led by Assecor and @KImachtSchule, and mandated by the FWU. Join us to push the boundaries of #AI in education and build the core intelligence that will power adaptive learning for millions of teachers and pupils in real classrooms. We’re developing cutting-edge AI tutoring models, knowledge tracing, and recommendation systems to provide high-quality, personalised learning for all students—regardless of background. 🔹 Several full-time ML roles (2–3 years) 🔹 Top-tier AI research environment 🔹 From prototypes to production-ready solutions for national-scale impact 🔹 Open to on-site, hybrid & remote 👉 Apply now (quick & easy): forms.gle/XmLkwEDD45fY5c… (or via ais@tue.ellis.eu) We work closely with our amazing collaborators at Assecor, @KImachtSchule (@stes_io, @SchulzAuguste), Tübingen AI Center (@MatthiasBethge), the Hector Research Institute of Education Sciences and Psychology (Ulrich Trautwein), the Tübingen Center for Digital Education (Andreas Lachner) and the @IWMtue (@ucress). Let’s shape the future of AI-powered education together. 🔥 #AI #MachineLearning #EdTech #hiring #OpenSource #MLJobs

English

3.2K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·5 Ara

@ShashwatGoel7 Reminds me a lot of the motivation for speculative decoding: most tokens follow naturally from the context and can be easily predicted by a small draft model. But every once in a while the model is at a crossroads and needs the full/RLVR-trained model to predict the best token.

English

249

Shashwat Goel @ ICLR'26@ShashwatGoel7·4 Ara

Was wondering whether GRPO style RL is "only a few tokens deep"... Intuitively, we take a next token predictor, and slightly upweigh some tokens, s.t. NTP leads to success Found this interesting ICLR sub preliminarily indicating this hypothesis is true: openreview.net/forum?id=8vWIX…

English

8.3K

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Ofir Press@OfirPress·4 Ara

We'll present AlgoTune today 4:30-7:30PM at Hall C,D,E #2514 It's a benchmark where LMs optimize the runtime of programs like gzip compression, AES encryption,... Current LMs achieve 1% of their future potential on this, we're super excited to see how the competition unfolds!

English

3.8K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·28 Kas

@ShashwatGoel7 @lateinteraction Great timing for this, one day before DSMath-V2 😄 promising pipeline with a model to both generate and self-verify solutions in conjunction with a separate meta-verifier. Extending this seems possible, but getting domain-specific cold-start data and verifiers will be tricky?

English

Shashwat Goel @ ICLR'26@ShashwatGoel7·26 Kas

@lateinteraction How does one scale up reflection learning? Just adding to prompts/scaffold won't be enough. Context distillation?

English

432

Omar Khattab@lateinteraction·26 Kas

I still maintain that reflective learning is the future of learning algorithms. This is related to but quite a bit richer than thinking about making value functions that work.

Omar Khattab@lateinteraction

ok but does GEPA do it? Reflective prompt optimizers have somewhat solved the problem that “if model can’t guess the right answer, no learning ever happens” The model should be able to LOOK at its failures and be like “ok yea these were stupid guesses, gotta try x in hindsight”

English

20.3K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·24 Kas

Lightning fast work by my lab mates and my old supervisor Shyam 🔥Barely two weeks after Cambrian-S came out, they show how the proposed benchmarks can be solved almost perfectly by a simple, non-video CLIP baseline.

Vishaal Udandarao@vishaal_urao

🚀 New paper! arxiv.org/abs/2511.16655 Recently, Cambrian-S released models & two benchmarks (VSR & VSC) for “spatial supersensing” in video! We found: 1️⃣ Simple no-frame baseline (NoSense) ~perfectly solves VSR! 2️⃣ Tiny sanity check collapses Cambrian-S perf to 0% on VSC! 🧵👇

English

381

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Hardik Bhatnagar@hrdkbhatnagar·18 Kas

🚨 Breaking @WeiboLLM's VibeThinker 1.5B leads the Sober Reasoning leaderboard for its size Punching way above its weight -- outperforming even 32B models 🔥 Outstanding work, @WeiboLLM team!

English

2.1K

Joschka Strueber @ Tuebingen AI Center🇪🇺@JoschkaStrueber·30 Eki

Great work from my friend and colleague Vishaal!

Vishaal Udandarao@vishaal_urao

🚀New Paper arxiv.org/abs/2510.20860 We conduct a systematic data-centric study for speech-language pretraining, to improve end-to-end spoken-QA! 🎙️🤖 Using our data-centric insights, we pretrain a 3.8B SpeechLM (called SpeLangy) outperforming 3x larger models! 🧵👇

English

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Shashwat Goel @ ICLR'26@ShashwatGoel7·15 Tem

Presenting today at #ICML2025. To learn how to measure language model similarity, and it's effects on LLM as a Judge and Weak to Strong distillation, join our poster session: Today 11 am -1:30 pm, East Exhibition Hall A-B E-2411 w/ @AmyPrb @JoschkaStrueber @AmandaIlze

Shashwat Goel @ ICLR'26@ShashwatGoel7

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

English

816

Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi

Shashwat Goel @ ICLR'26@ShashwatGoel7·4 Tem

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️. Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation. 🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer. 👨‍🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡 Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts: Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated. Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc. 🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs! See our paper for more, its packed with insights. 🧵 has paper and more result figures.

English

229

35.6K

Keşfet

@iclr_conf @full__rank @maksym_andr @dhtikna @FanqingMengAI @ELLISInst_Tue @KImachtSchule @stes_io