taesiri

619 posts

taesiri

@taesiri

Research Scientist @ EA Sports, VLMs, Evals, All opinions are my own.

Planet Mars Katılım Şubat 2017

5.5K Takip Edilen964 Takipçiler

taesiri retweetledi

Logan Bolton@septisum·6d

Very cool to see @giffmana talk about a benchmark I helped create 😃

Kosta Derpanis (sabbatical in Zurich)@CSProfKGD

English

529

taesiri retweetledi

Tin (Kevin) Nguyen@tin_ng_qn·9 May

After a long journey of submissions, rebuttals, and revisions, I am excited to share that our paper, Highlighted Chain of Thought (HoT), has been accepted to the Transactions on Machine Learning Research (TMLR) 🥳🎉 In this work, we study how grounding reasoning traces with inputs can improve human verification and llm accuracy. Project page: highlightedchainofthought.github.io Huge thanks to my collaborators and everyone who provided feedback throughout the process!

English

195

taesiri retweetledi

Logan Bolton@septisum·29 Nis

Hood popped. Photo taken. “Hey ChatGPT, how do I check my car’s oil level?” And it returns a giant block of text… Instead, a human would point to the oil cap and draw on the photo to answer! We explore how to unlock VLMs to do that, i.e., annotating on the image to guide users through answers visually: sketchvlm.github.io 1/n 🧵

English

967

taesiri@taesiri·24 Nis

@OfficialLoganK @GoogleAIStudio Can we use Claude Code? 😬

English

2.3K

Logan Kilpatrick@OfficialLoganK·24 Nis

We are hiring a bunch of Members of the Technical Staff for @GoogleAIStudio who can blend PM, design, eng, and more If this is you, pls DM me, we will move fast for the best people.

English

239

149

3.6K

576.3K

taesiri@taesiri·18 Mar

@septisum 😂😂

QME

Logan Bolton@septisum·18 Mar

I had Claude draft an email for me and it assumed my last name is Nguyen lol

English

155

taesiri@taesiri·10 Mar

@omarsar0 Why not buy two or even three accounts?

English

227

elvis@omarsar0·10 Mar

After a weekend of building intensively with gpt-5.4, I had to upgrade to Pro again. It's too good! Keep an open mind to different coding agents. Love to learn them all. I use both Claude Code and Codex now. They have their own unique strengths.

English

173

14.4K

taesiri@taesiri·18 Şub

@wenhaocha1 Where is Opus 4.6?

English

2.2K

Wenhao Chai@wenhaocha1·17 Şub

It’s over. Gemini 3 Deep Think achieved a 3300 rating on LiveCodeBench Pro almost surpassing all humans (99.99%) and is leading GPT-5.2 by a massive margin of 1000 points. Gemini is insanely strong! Link：livecodebenchpro.com/projects/livec…

English

1.1K

142.5K

taesiri@taesiri·16 Şub

@gdb The limits are too tight; please raise them

English

208

Greg Brockman@gdb·16 Şub

codex momentum is strong, and many people are feeling just how big of a leap 5.3 is. if your organization hasn't tried codex yet, it's worth revisiting.

Sam Altman@sama

Codex weekly users have more than tripled since the beginning of the year!

English

218

1.8K

133.1K

taesiri@taesiri·16 Şub

Somebody made OpenClaw that runs on a potato 🥔 Installed it on a Milk-V DUO S I had lying around. First try had some issues with Gemini 3.0 API calls. Then I used Antigravity + Gemini 3.0 Flash to fix PicoClaw, compiled it, and installed it on the Milk-V. Now works like a charm 🚀 PicoClaw: github.com/sipeed/picoclaw Milk-V DUO S: milkv.io/duo-s

English

1.4K

taesiri@taesiri·14 Oca

MOTIVE (MOTIon attribution for Video gEneration) So, which training clips make your generated videos move realistically? - High-influence clips show clear, physically grounded dynamics (rolling objects, floating motion) - Negative-influence clips tend to be static footage, camera-only motion, or cartoons with simplified kinematics - Motion attribution is not simply selecting "motion-rich" clips; the top 10% selected videos have only 4.3% higher mean motion magnitude than the bottom 10% ArXiv: arxiv.org/abs/2601.08828

English

322

taesiri@taesiri·14 Oca

DART - teaching LLMs to spontaneously use Python during long chain-of-thought reasoning. Build rollout trees during RL training, inject tool hints at high-entropy (uncertain) positions, then credit sub-trajectories where code actually helped. No annotated data needed. ArXiv: arxiv.org/abs/2601.08274

English

196

taesiri@taesiri·14 Oca

An efficientDiT designed for high-fidelity text-to-image generation on mobile and edge devices. -Uses TinyCLIP and Gemma3-4b-it as text encoders -Knowledge distillation from Qwen-Image (20B) teacher -4-step generation achieves near-lossless quality compared to 28-step baseline ArXiv: arxiv.org/abs/2601.08303

English

140

taesiri@taesiri·14 Oca

JudgeRLVR, a two-stage training paradigm for making LLM reasoning more efficient. Key insight: Train LLMs to judge solutions before training them to generate. The judging stage teaches what good reasoning looks like, so the model stops wasting tokens on trial-and-error. ArXiv: arxiv.org/abs/2601.08468

English

taesiri@taesiri·13 Oca

Currently, inter-model communication happens through text tokens, which is bandwidth-limited. So, why not just letting models directly read and write to each other's key-value (K-V) cache latent spaces? ArXiv: arxiv.org/abs/2601.06123

English

155

taesiri@taesiri·13 Oca

Linear attention's reliance on a single global KV summary creates two problems: - Rank limitation: The attention matrix rank is capped at d (head dimension) regardless of sequence length N, severely limiting representational capacity when N >> d - Loss of sparsity: As sequence length grows, attention weights become increasingly uniform (high entropy), losing the ability to selectively focus on relevant tokens MHLA fixes this by: - Splitting tokens into blocks with local KV summaries - Learning query-specific mixtures of these summaries ArXiv: arxiv.org/abs/2601.07832

English

taesiri@taesiri·13 Oca

SOTA reasoning models catastrophically fail when given noisy or distracting context; something that can happen in real-world deployments. - Agentic workflows make it WORSE - Random noise triggers misalignment without adversarial intent - More thinking = worse results in noisy settings Fix: RARE: reward models for finding helpful info in noise, not just final answers. ArXiv: arxiv.org/abs/2601.07226

English

taesiri@taesiri·13 Oca

Dr. Zero: Search agents that teach themselves; no training data needed! Proposer-solver co-evolution: one model generates increasingly hard questions, the other learns to answer them using web search. Key trick: HRPO (Hop-Grouped Relative Policy Optimization) groups questions by reasoning complexity for 4× more efficient RL training. Result: Matches/beats supervised methods on QA benchmarks while using zero human-curated data. ArXiv: arxiv.org/abs/2601.07055

English

taesiri@taesiri·12 Oca

Long CoT reasoning in LLMs exhibits stable "molecular-like" structures formed by three types of reasoning behaviors that function analogously to chemical bonds: - Deep Reasoning = covalent bonds (logical backbone) - Self-Reflection = hydrogen bonds (error correction) - Self-Exploration = van der Waals (hypothesis branching) Models learn reasoning structure, not keywords. Mixing incompatible structures causes chaos; explaining why combining diverse CoT data often fails. Introduces Mole-Syn: synthesize effective Long CoT from scratch using only instruction LLMs by transferring behavioral transition graphs. ArXiv: arxiv.org/abs/2601.06002

English

taesiri@taesiri·12 Oca

"Over-searching" problem: LLMs keep using search tool, even when queries are unanswerable, wasting compute & causing hallucinations. - Reasoning models are worse at this - Retrieval quality matters, Noisy retrieval causes 3.6× more searching - Snowball effect in conversations: Over-searching compounds across multi-turn conversations ArXiv: arxiv.org/abs/2601.05503

English

taesiri@taesiri·12 Oca

GenCtrl: a formal control-theoretic framework to answer a fundamental question that current AI research largely ignores: are generative models actually controllable in the first place? Spoiler: often not. Controllability is fragile & task-dependent. ArXiv: arxiv.org/abs/2601.05637

English

101

Keşfet

@giffmana @OfficialLoganK @GoogleAIStudio @septisum @omarsar0 @wenhaocha1 @gdb @elonmusk