Joschka Strueber @ Tuebingen AI Center🇪🇺

36 posts

Joschka Strueber @ Tuebingen AI Center🇪🇺

Joschka Strueber @ Tuebingen AI Center🇪🇺

@JoschkaStrueber

PhD student in https://t.co/U3NlKbAHiM at @uni_tue and @MPI_IS (IMPRS-IS). LLM multi-turn post-training and evaluations.

Katılım Şubat 2025
90 Takip Edilen28 Takipçiler
Sabitlenmiş Tweet
Joschka Strueber @ Tuebingen AI Center🇪🇺
🚨🚨🚨 The second full paper of my PhD is done. We introduce a scalable approach for credit assignment in multi-turn information seeking called ΔBelief-RL. More stable training and better generalization than outcome-based GRPO. 🚨🚨🚨
Ilze Amanda Auzina@AmandaIlze

How can agents learn in long, open-ended tasks where success is rare and rewards are sparse? 👀 🚨 Enter ∆Belief-RL: we show how to use agent’s own belief updates as a dense reward for turn-level credit assignment. the result? Surprisingly strong generalization. (1/8) 🧵⬇️

English
1
0
7
192
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Patrik Reizinger
Patrik Reizinger@rpatrik96·
Unsupervised skill discovery methods have achieved remarkable improvement in the last few years. Why is that the case? In our @iclr_conf paper, we use identifiability theory to explain it, by proving that Contrastive Successor Features learns the ground truth states. A thread.
Patrik Reizinger tweet media
English
2
5
16
1K
Joschka Strueber @ Tuebingen AI Center🇪🇺
We're running a short (~10 min) survey to map these problems more systematically. Which archives do you use? What languages do your sources appear in? Where does discovery actually get stuck? Your answers directly shape what we build next. 🔗 tally.so/r/ja7qN9
English
1
0
0
78
Joschka Strueber @ Tuebingen AI Center🇪🇺
If you've ever spent days searching five archives in three languages only to come up empty: we want to understand that better. We're researchers at the Tübingen AI Center studying how historians find primary sources and where the process breaks down. 🔗 archivum.umso.co
English
1
1
5
247
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Babak Rahmani
Babak Rahmani@babakRmni·
🧵Debugging Code World Models A few months ago we started studying CWMs. The plan was post-training an LLM on code execution traces. Two weeks in, we realised a paper by Meta had already done much of this : arxiv.org/pdf/2510.02387. We however identified what's wrong with them!
English
1
5
10
996
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
hallerite
hallerite@hallerite·
Happy to finally share what I have been working on for some time now. Introducing »Ludic« – an LLM-RL library for the era of experience. While there are now a lot of LLM-RL codebases, even many good ones, I want to share my very idiosyncratic way to think about LLM-RL.
hallerite tweet media
English
15
33
267
22K
Fanqing Meng
Fanqing Meng@FanqingMengAI·
I think it is not new.... In dpsk 3.2, they use expert rl -> joint sft -> joint rl in it. In longcat, they use expert rl -> model mr -> joint rl in it. mimo replace sft with opd. everyone know opd is better than sft :)
English
4
4
134
72K
Joschka Strueber @ Tuebingen AI Center🇪🇺
@dhtikna @FanqingMengAI If the model you train is smaller than the model you distill, OPD is usually much cheaper than SFT. The cost to sample off-policy data should be much higher than using the same bigger model to compute logprobs on on-policy data samples by the cheaper student.
English
1
0
2
129
Ankith 🐋/acc
Ankith 🐋/acc@dhtikna·
@FanqingMengAI I think one counter point is that the on policy distillation just takes more compute and doesnt produce better results than deepseek's SFT+ loss recovering RL which may be much cheaper (as SFT part is off policy). So its not worth the engineering complexity plus the extra compute
English
3
0
14
4.5K
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Wieland Brendel
Wieland Brendel@wielandbr·
🚀 @ELLISInst_Tue will lead the AI development for Germany’s new open-source nationwide Adaptive Intelligent System (AIS) learning platform for schools – and we’re #hiring! We are part of a national consortium led by Assecor and @KImachtSchule, and mandated by the FWU. Join us to push the boundaries of #AI in education and build the core intelligence that will power adaptive learning for millions of teachers and pupils in real classrooms. We’re developing cutting-edge AI tutoring models, knowledge tracing, and recommendation systems to provide high-quality, personalised learning for all students—regardless of background. 🔹 Several full-time ML roles (2–3 years) 🔹 Top-tier AI research environment 🔹 From prototypes to production-ready solutions for national-scale impact 🔹 Open to on-site, hybrid & remote 👉 Apply now (quick & easy): forms.gle/XmLkwEDD45fY5c… (or via ais@tue.ellis.eu) We work closely with our amazing collaborators at Assecor, @KImachtSchule (@stes_io, @SchulzAuguste), Tübingen AI Center (@MatthiasBethge), the Hector Research Institute of Education Sciences and Psychology (Ulrich Trautwein), the Tübingen Center for Digital Education (Andreas Lachner) and the @IWMtue (@ucress). Let’s shape the future of AI-powered education together. 🔥 #AI #MachineLearning #EdTech #hiring #OpenSource #MLJobs
Wieland Brendel tweet media
English
0
11
33
3.2K
Joschka Strueber @ Tuebingen AI Center🇪🇺
@ShashwatGoel7 Reminds me a lot of the motivation for speculative decoding: most tokens follow naturally from the context and can be easily predicted by a small draft model. But every once in a while the model is at a crossroads and needs the full/RLVR-trained model to predict the best token.
English
0
0
1
249
Shashwat Goel @ ICLR'26
Shashwat Goel @ ICLR'26@ShashwatGoel7·
Was wondering whether GRPO style RL is "only a few tokens deep"... Intuitively, we take a next token predictor, and slightly upweigh some tokens, s.t. NTP leads to success Found this interesting ICLR sub preliminarily indicating this hypothesis is true: openreview.net/forum?id=8vWIX…
Shashwat Goel @ ICLR'26 tweet media
English
2
9
93
8.3K
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Ofir Press
Ofir Press@OfirPress·
We'll present AlgoTune today 4:30-7:30PM at Hall C,D,E #2514 It's a benchmark where LMs optimize the runtime of programs like gzip compression, AES encryption,... Current LMs achieve 1% of their future potential on this, we're super excited to see how the competition unfolds!
Ofir Press tweet media
English
2
12
42
3.8K
Joschka Strueber @ Tuebingen AI Center🇪🇺
@ShashwatGoel7 @lateinteraction Great timing for this, one day before DSMath-V2 😄 promising pipeline with a model to both generate and self-verify solutions in conjunction with a separate meta-verifier. Extending this seems possible, but getting domain-specific cold-start data and verifiers will be tricky?
English
0
0
1
34
Omar Khattab
Omar Khattab@lateinteraction·
I still maintain that reflective learning is the future of learning algorithms. This is related to but quite a bit richer than thinking about making value functions that work.
Omar Khattab@lateinteraction

ok but does GEPA do it? Reflective prompt optimizers have somewhat solved the problem that “if model can’t guess the right answer, no learning ever happens” The model should be able to LOOK at its failures and be like “ok yea these were stupid guesses, gotta try x in hindsight”

English
4
5
91
20.3K
Joschka Strueber @ Tuebingen AI Center🇪🇺
Lightning fast work by my lab mates and my old supervisor Shyam 🔥Barely two weeks after Cambrian-S came out, they show how the proposed benchmarks can be solved almost perfectly by a simple, non-video CLIP baseline.
Vishaal Udandarao@vishaal_urao

🚀 New paper! arxiv.org/abs/2511.16655 Recently, Cambrian-S released models & two benchmarks (VSR & VSC) for “spatial supersensing” in video! We found: 1️⃣ Simple no-frame baseline (NoSense) ~perfectly solves VSR! 2️⃣ Tiny sanity check collapses Cambrian-S perf to 0% on VSC! 🧵👇

English
0
1
5
381
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Hardik Bhatnagar
Hardik Bhatnagar@hrdkbhatnagar·
🚨 Breaking @WeiboLLM's VibeThinker 1.5B leads the Sober Reasoning leaderboard for its size Punching way above its weight -- outperforming even 32B models 🔥 Outstanding work, @WeiboLLM team!
Hardik Bhatnagar tweet media
English
1
6
16
2.1K
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Shashwat Goel @ ICLR'26
Shashwat Goel @ ICLR'26@ShashwatGoel7·
Presenting today at #ICML2025. To learn how to measure language model similarity, and it's effects on LLM as a Judge and Weak to Strong distillation, join our poster session: Today 11 am -1:30 pm, East Exhibition Hall A-B E-2411 w/ @AmyPrb @JoschkaStrueber @AmandaIlze
Shashwat Goel @ ICLR'26@ShashwatGoel7

🚨Great Models Think Alike and this Undermines AI Oversight🚨 New paper quantifies LM similarity (1) LLM-as-a-judge favor more similar models🤥 (2) Complementary knowledge benefits Weak-to-Strong Generalization☯️ (3) More capable models have more correlated failures 📈🙀 🧵👇

English
0
2
14
816
Joschka Strueber @ Tuebingen AI Center🇪🇺 retweetledi
Shashwat Goel @ ICLR'26
Shashwat Goel @ ICLR'26@ShashwatGoel7·
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️. Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation. 🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer. 👨‍🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡 Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts: Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated. Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc. 🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs! See our paper for more, its packed with insights. 🧵 has paper and more result figures.
Shashwat Goel @ ICLR'26 tweet media
English
11
37
229
35.6K