taesiri

619 posts

taesiri banner
taesiri

taesiri

@taesiri

Research Scientist @ EA Sports, VLMs, Evals, All opinions are my own.

Planet Mars Katılım Şubat 2017
5.5K Takip Edilen964 Takipçiler
taesiri retweetledi
Tin (Kevin) Nguyen
Tin (Kevin) Nguyen@tin_ng_qn·
After a long journey of submissions, rebuttals, and revisions, I am excited to share that our paper, Highlighted Chain of Thought (HoT), has been accepted to the Transactions on Machine Learning Research (TMLR) 🥳🎉 In this work, we study how grounding reasoning traces with inputs can improve human verification and llm accuracy. Project page: highlightedchainofthought.github.io Huge thanks to my collaborators and everyone who provided feedback throughout the process!
English
0
2
7
195
taesiri retweetledi
Logan Bolton
Logan Bolton@septisum·
Hood popped. Photo taken. “Hey ChatGPT, how do I check my car’s oil level?” And it returns a giant block of text… Instead, a human would point to the oil cap and draw on the photo to answer! We explore how to unlock VLMs to do that, i.e., annotating on the image to guide users through answers visually: sketchvlm.github.io 1/n 🧵
English
4
5
10
967
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
We are hiring a bunch of Members of the Technical Staff for @GoogleAIStudio who can blend PM, design, eng, and more If this is you, pls DM me, we will move fast for the best people.
English
239
149
3.6K
576.3K
Logan Bolton
Logan Bolton@septisum·
I had Claude draft an email for me and it assumed my last name is Nguyen lol
Logan Bolton tweet media
English
2
0
4
155
taesiri
taesiri@taesiri·
@omarsar0 Why not buy two or even three accounts?
English
0
0
1
227
elvis
elvis@omarsar0·
After a weekend of building intensively with gpt-5.4, I had to upgrade to Pro again. It's too good! Keep an open mind to different coding agents. Love to learn them all. I use both Claude Code and Codex now. They have their own unique strengths.
elvis tweet media
English
31
10
173
14.4K
Wenhao Chai
Wenhao Chai@wenhaocha1·
It’s over. Gemini 3 Deep Think achieved a 3300 rating on LiveCodeBench Pro almost surpassing all humans (99.99%) and is leading GPT-5.2 by a massive margin of 1000 points. Gemini is insanely strong! Link:livecodebenchpro.com/projects/livec…
Wenhao Chai tweet media
English
64
91
1.1K
142.5K
taesiri
taesiri@taesiri·
@gdb The limits are too tight; please raise them
English
0
0
1
208
taesiri
taesiri@taesiri·
Somebody made OpenClaw that runs on a potato 🥔 Installed it on a Milk-V DUO S I had lying around. First try had some issues with Gemini 3.0 API calls. Then I used Antigravity + Gemini 3.0 Flash to fix PicoClaw, compiled it, and installed it on the Milk-V. Now works like a charm 🚀 PicoClaw: github.com/sipeed/picoclaw Milk-V DUO S: milkv.io/duo-s
taesiri tweet mediataesiri tweet media
English
0
1
10
1.4K
taesiri
taesiri@taesiri·
MOTIVE (MOTIon attribution for Video gEneration) So, which training clips make your generated videos move realistically? - High-influence clips show clear, physically grounded dynamics (rolling objects, floating motion) - Negative-influence clips tend to be static footage, camera-only motion, or cartoons with simplified kinematics - Motion attribution is not simply selecting "motion-rich" clips; the top 10% selected videos have only 4.3% higher mean motion magnitude than the bottom 10% ArXiv: arxiv.org/abs/2601.08828
taesiri tweet media
English
0
0
4
322
taesiri
taesiri@taesiri·
DART - teaching LLMs to spontaneously use Python during long chain-of-thought reasoning. Build rollout trees during RL training, inject tool hints at high-entropy (uncertain) positions, then credit sub-trajectories where code actually helped. No annotated data needed. ArXiv: arxiv.org/abs/2601.08274
taesiri tweet media
English
0
0
2
196
taesiri
taesiri@taesiri·
An efficientDiT designed for high-fidelity text-to-image generation on mobile and edge devices. -Uses TinyCLIP and Gemma3-4b-it as text encoders -Knowledge distillation from Qwen-Image (20B) teacher -4-step generation achieves near-lossless quality compared to 28-step baseline ArXiv: arxiv.org/abs/2601.08303
taesiri tweet media
English
0
0
1
140
taesiri
taesiri@taesiri·
JudgeRLVR, a two-stage training paradigm for making LLM reasoning more efficient. Key insight: Train LLMs to judge solutions before training them to generate. The judging stage teaches what good reasoning looks like, so the model stops wasting tokens on trial-and-error. ArXiv: arxiv.org/abs/2601.08468
taesiri tweet media
English
0
0
1
91
taesiri
taesiri@taesiri·
Currently, inter-model communication happens through text tokens, which is bandwidth-limited. So, why not just letting models directly read and write to each other's key-value (K-V) cache latent spaces? ArXiv: arxiv.org/abs/2601.06123
taesiri tweet media
English
0
0
1
155
taesiri
taesiri@taesiri·
Linear attention's reliance on a single global KV summary creates two problems: - Rank limitation: The attention matrix rank is capped at d (head dimension) regardless of sequence length N, severely limiting representational capacity when N >> d - Loss of sparsity: As sequence length grows, attention weights become increasingly uniform (high entropy), losing the ability to selectively focus on relevant tokens MHLA fixes this by: - Splitting tokens into blocks with local KV summaries - Learning query-specific mixtures of these summaries ArXiv: arxiv.org/abs/2601.07832
taesiri tweet media
English
0
0
2
77
taesiri
taesiri@taesiri·
SOTA reasoning models catastrophically fail when given noisy or distracting context; something that can happen in real-world deployments. - Agentic workflows make it WORSE - Random noise triggers misalignment without adversarial intent - More thinking = worse results in noisy settings Fix: RARE: reward models for finding helpful info in noise, not just final answers. ArXiv: arxiv.org/abs/2601.07226
taesiri tweet media
English
0
0
2
63
taesiri
taesiri@taesiri·
Dr. Zero: Search agents that teach themselves; no training data needed! Proposer-solver co-evolution: one model generates increasingly hard questions, the other learns to answer them using web search. Key trick: HRPO (Hop-Grouped Relative Policy Optimization) groups questions by reasoning complexity for 4× more efficient RL training. Result: Matches/beats supervised methods on QA benchmarks while using zero human-curated data. ArXiv: arxiv.org/abs/2601.07055
taesiri tweet media
English
0
0
2
80
taesiri
taesiri@taesiri·
Long CoT reasoning in LLMs exhibits stable "molecular-like" structures formed by three types of reasoning behaviors that function analogously to chemical bonds: - Deep Reasoning = covalent bonds (logical backbone) - Self-Reflection = hydrogen bonds (error correction) - Self-Exploration = van der Waals (hypothesis branching) Models learn reasoning structure, not keywords. Mixing incompatible structures causes chaos; explaining why combining diverse CoT data often fails. Introduces Mole-Syn: synthesize effective Long CoT from scratch using only instruction LLMs by transferring behavioral transition graphs. ArXiv: arxiv.org/abs/2601.06002
taesiri tweet media
English
0
0
2
75
taesiri
taesiri@taesiri·
"Over-searching" problem: LLMs keep using search tool, even when queries are unanswerable, wasting compute & causing hallucinations. - Reasoning models are worse at this - Retrieval quality matters, Noisy retrieval causes 3.6× more searching - Snowball effect in conversations: Over-searching compounds across multi-turn conversations ArXiv: arxiv.org/abs/2601.05503
taesiri tweet media
English
0
1
3
94
taesiri
taesiri@taesiri·
GenCtrl: a formal control-theoretic framework to answer a fundamental question that current AI research largely ignores: are generative models actually controllable in the first place? Spoiler: often not. Controllability is fragile & task-dependent. ArXiv: arxiv.org/abs/2601.05637
taesiri tweet media
English
0
0
2
101