SamT

1.1K posts

SamT

@st_bomba

Making AI stuff

Katılım Kasım 2021

4.2K Takip Edilen2.4K Takipçiler

SamT retweetledi

Shane Gu@shaneguML·13 Mar

I had a chance to chat with Andrej when he visited Tokyo in 2022, right after he wrapped up his 5-year work with Elon (he told me he was "recovering"). It was clear that his true passions are education and empowerment. You can see this reflected on his X feed—there is no hidden agenda, just a genuine desire to share what excites him. In a way, Andrej represents the ultimate nightmare for organizations like OpenAI. His goal is to empower developers outside of closed labs to train their own models, making AI knowledge accessible far beyond a small, elite group. If you empower enough people, closed labs will eventually run out of the capital needed to acquire all that distributed talent, and the monopoly on AI power will dissolve. This is just my speculation, but looking at it through this lens, it makes complete sense why he left OpenAI after just a year.

Noam Brown@polynoamial

@saranormous @karpathy @NoPriorsPod Why is he not at a frontier AI lab at the most pivotal time in human history since at least the industrial revolution?

English

778

128.7K

SamT retweetledi

Alan Karthikesalingam@alan_karthi·11 Mar

Delighted to share: Prospective Study of Conversational Diagnostic AI in Ambulatory Primary Care @GoogleDeepMind @GoogleResearch @BIDMChealth – arxiv.org/pdf/2603.08448 research.google/blog/exploring… Like real medical training, we @GoogleDeepMind took a series of milestones on the way to performing these real-world clinical encounters: from passing medical licensing-style exams (rdcu.be/dgGfJ) to mirroring PCP skills in simulated consultation OSCEs (nature.com/articles/s4158…). Now, in partnership with @BIDMChealth, 100 patients engaged in a text-chat with our AI (AMIE) up to 5 days before their in-person PCP visit. AMIE took a comprehensive history and drafted a summary & differential diagnosis (DDx) for the PCP. It was carefully monitored by a physician ready to intervene based on strict safety criteria.

English

8.8K

SamT@st_bomba·1 Mar

@neural_avb Would you recommend any specific frameworks or do you code your agents from scratch?

English

AVB@neural_avb·28 Şub

x.com/i/article/2027…

ZXX

633

121.3K

SamT retweetledi

Oriol Vinyals@OriolVinyalsML·18 Kas

The secret behind Gemini 3? Simple: Improving pre-training & post-training 🤯 Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight! Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team. Congratulations to the whole team 💙💙💙

English

118

547

4.4K

SamT retweetledi

Google DeepMind@GoogleDeepMind·18 Kas

Our first release is Gemini 3 Pro, which is rolling out globally starting today. It significantly outperforms 2.5 Pro across the board: 🥇 Tops LMArena and WebDev @arena leaderboards 🧠 PhD-level reasoning on Humanity’s Last Exam 📋 Leads long-horizon planning on Vending-Bench 2

English

108

908

269.4K

SamT retweetledi

Artificial Analysis@ArtificialAnlys·18 Kas

Gemini 3 Pro is the new leader in AI. Google has the leading language model for the first time, with Gemini 3 Pro debuting +3 points above GPT-5.1 in our Artificial Analysis Intelligence Index @GoogleDeepMind gave us pre-release access to Gemini 3 Pro Preview. The model outperforms all other models in Artificial Analysis Intelligence Index. It demonstrates strength across the board, coming in first in 5 of the 10 evaluations that make up Intelligence Index. Despite these intelligence gains, Gemini 3 Pro Preview shows improved token efficiency from Gemini 2.5 Pro, using significantly fewer tokens on the Intelligence Index than other leading models such as Kimi K2 Thinking and Grok 4. However, given its premium pricing ($2/$12 per million input/output tokens for <200K context), Gemini 3 Pro is among the most expensive models to run our Intelligence Index evaluations. Key takeaways: 📖 Leading intelligence: Gemini 3 Pro Preview is the leading model in 5 of 10 evals in the Artificial Analysis Intelligence Index, including GPQA Diamond, MMLU-Pro, HLE, LiveCodeBench and SciCode. Its score of 37% on Humanity’s Last Exam is particularly impressive, improving on the previous best model by more than 10 percentage points. It also is leading in AA-Omniscience, Artificial Analysis’ new knowledge and hallucination evaluation, coming first in both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct). Given that factual recall correlates closely with model size, this may point to Gemini 3 Pro being a much larger model than its competitors 💻 Advanced coding and agentic capabilities: Gemini 3 Pro Preview leads two of the three coding evaluations in the Artificial Analysis Intelligence Index, including an impressive 56% in SciCode, an improvement of over 10 percentage points from the previous highest score. It is also strong in agentic contexts, achieving the second highest score in Terminal-Bench Hard and Tau2-Bench Telecom 🖼️ Multimodal capabilities: Gemini 3 Pro Preview is a multi-modal model, with the ability to take text, images, video and audio as input. It scores the highest of any model on MMMU-Pro, a benchmark that tests reasoning abilities with image inputs. Google now occupies the first, third and fourth position in our MMMU-Pro leaderboard (with GPT-5.1 taking out second place just last week) 💲Premium Pricing: To measure cost, we report Cost to Run the Artificial Analysis Intelligence Index, which combines input and output token prices with token efficiency to reflect true usage cost. Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing of $2/$12 USD per million input/output tokens (≤200k token context) results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index compared to its predecessor, and the model is among the most expensive to run on our Intelligence Index. Google also continues to price long context workloads higher than lower context workloads, charging $4/$18 per million input/output tokens for ≥200k token context. ⚡ Speed: Gemini 3 Pro Preview has comparable speeds to Gemini 2.5 Pro, with 128 output tokens per second. This places it ahead of other frontier models including GPT-5.1 (high), Kimi K2 Thinking and Grok 4. This is potentially supported by Google’s first-party TPU accelerators Other details: Gemini 3 Pro Preview has a 1 million token context window, and includes support for tool calling, structured outputs, and JSON mode See below for further analysis

English

235

1.6K

250.9K

SamT@st_bomba·9 Kas

@akshay_pachaar Do you have a good example or repo for summary indexing?

English

1.1K

Akshay 🚀@akshay_pachaar·9 Kas

Here's a common misconception about RAG! Most people think RAG works like this: index a document → retrieve that same document. But indexing ≠ retrieval. What you index doesn't have to be what you feed the LLM. Once you understand this, you can build RAG systems that actually work. Here are 4 indexing strategies that separate good RAG from great RAG: 1) Chunk Indexing ↳ This is the standard approach. Split documents into chunks, embed them, store in a vector database, and retrieve the closest matches. ↳ Simple and effective, but large or noisy chunks will hurt your precision. 2) Sub-chunk Indexing ↳ Break your chunks into smaller sub-chunks for indexing, but retrieve the full chunk for context. ↳ This is powerful when a single section covers multiple concepts. You get better query matching without losing the surrounding context your LLM needs. 3) Query Indexing ↳ Instead of indexing raw text, generate hypothetical questions the chunk could answer. Index those questions instead. ↳ User queries naturally align better with questions than raw document text. This closes the semantic gap between what users ask and what you've stored. ↳ Perfect for QA systems. 4) Summary Indexing ↳ Use an LLM to summarize each chunk. Index the summary, retrieve the full chunk. ↳ This shines with dense, structured data like CSVs and tables where raw text embeddings fall flat. The bottom line: You don't need to retrieve exactly what you indexed. Match your indexing strategy to your data, and your RAG system will perform significantly better. What indexing strategies have worked best for you?

English

152

774

54.6K

SamT retweetledi

Crystal@crystalsssup·6 Kas

I'm so proud!! The open-source trillion parameters reasoning model <3 > SOTA on HLE (44.9%) and BrowseComp (60.2%)

Kimi.ai@Kimi_Moonshot

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: platform.moonshot.ai 🔗 Tech blog: moonshotai.github.io/Kimi-K2/thinki… 🔗 Weights & code: huggingface.co/moonshotai

English

1.4K

87.7K

SamT retweetledi

Shubham Saboo@Saboo_Shubham_·23 Eki

Vibe coding with AI Studio is so addictive! I built an AI Hair Stylist Agent that: > find the best haircut by looking my face > generates the after images using nano banana > lets me edit it live with the live API > find nearby salons using google maps Built in just 5 mins!

English

539

46.6K

SamT@st_bomba·10 Eki

@NVIDIAGeForce GeForce Day

English

NVIDIA GeForce@NVIDIAGeForce·10 Eki

🟢 GEFORCE DAY IS BACK 🟢 To celebrate, we're giving away TWO GeForce RTX 5080 Founders Edition GPUs, signed by NVIDIA CEO Jensen Huang. Want one? Comment "GeForce Day" for a chance to WIN & stay tuned for more!

English

57.6K

3.5K

47K

5.9M

SamT retweetledi

Ross Taylor@rosstaylor90·10 Eki

RL is not enough. It only reaches its potential when combined with other ideas. The most famous example is AlphaZero. RL was combined with self-play which created an implicit task curriculum that evolved through training. This is very different from many RL datasets for LLMs which have a fixed set of tasks. Even where the task set is fixed, RL still needs to be combined with other ideas to show signs of life. Thinking models only come to life through RL when we remove length penalisation and have enough prior knowledge in the training data. Looking ahead, long horizon tasks will require much more exploration and “going off-piste”. But current RL methods induce policy entropy collapse. Diverse mid-training before RL could help, but fundamentally current RL objectives don’t reward “interestingness” and deviating from the “current (high reward) thing”. And yet, discovery is all about deviating from the current thing - and the best ideas come from the wilderness. Deep learning is no exception. It was interesting long before it reached its true potential. We didn’t wait to introduce backprop and SGD until compute and data came online :). This is a long way of saying: crude RL maximalism is overhyped. The magic comes from the interplay of RL with other things, and the angels are in the details.

English

276

29.7K

SamT@st_bomba·8 Eki

@akshay_pachaar Good stuff! What would you recommend for multi agent fine tuning?

English

448

Akshay 🚀@akshay_pachaar·8 Eki

LLM fine-tuning techniques I'd learn if I were to customize them: Bookmark this. 1. LoRA 2. QLoRA 3. Prefix Tuning 4. Adapter Tuning 5. Instruction Tuning 6. P-Tuning 7. BitFit 8. Soft Prompts 9. RLHF 10. RLAIF 11. DPO (Direct Preference Optimization) 12. GRPO (Group Relative Policy Optimization) 13. RLAIF (RL with AI Feedback) 14. Multi-Task Fine-Tuning 15. Federated Fine-Tuning My favourite is GRPO for building reasoning models. What about you? I've shared my full tutorial on GRPO in the replies.

GIF

English

309

1.6K

81.1K

SamT retweetledi

Nathan Lambert@natolambert·22 Eyl

Thinking, Searching, and Acting A reflection on reasoning models. It's easy to fixate on the "thinking" that gave reasoning models their name, but just over a year out from o1-preview's release by OpenAI, the core primitives that make up models today has expanded. Searching and executing tools make up for their deficiencies as probabilistic tools with outdated information in their parameters. Together, these three actions will act as the foundation of the systems we use for years, and the engineer aspects of them matter just as much as getting precisely the right model weights.

Interconnects@interconnectsai

Thinking, Searching, and Acting A reflection on reasoning models. interconnects.ai/p/thinking-sea…

English

345

47.3K

SamT@st_bomba·21 Eyl

@Shruti_0810 MIT

SamT retweetledi

Shruti Codes@Shruti_0810·20 Eyl

"Mathematics for Computer Science" — MIT This book of 1048 pages is now FREE. A MUST for all Beginners. To Get it: 1. Follow me (so that i can DM you 2. Repost 3. Comment "MIT"

English

638

532

2.3K

295.6K

SamT retweetledi

Lakshya A Agrawal@LakshyAAAgrawal·17 Eyl

In this context, GEPA works as a prompt optimizer, so the end result is a prompt (or multiple prompts for a multi-agent system, one for each component). However, one aspect that does not get highlighted enough is that GEPA is a text evolution engine: Given a target metric, GEPA can efficiently search/evolve the right text to improve that metric. What the text represents is upto the user. For example, In this notebook (github.com/gepa-ai/gepa/b…), we use text to represent a full agent code, and GEPA ends up discovering a very sophisticated agent (that can perform self-reflection and iterative refinement on code) for ARC-AGI, improving Gemini-2.5-Pro's score by +5.5% on ARC-AGI-1. In the paper, we also explore generating fast code kernels for hardware like GPUs/NPUs.

English

355

110.8K

SamT@st_bomba·19 Eyl

@akshay_pachaar 👍🏻👍🏻👍🏻

QME

124

Akshay 🚀@akshay_pachaar·19 Eyl

are you interested in learning about robotics?

English

131

18.3K

SamT retweetledi

Shubham Saboo@Saboo_Shubham_·17 Eyl

China's Alibaba just dropped an opensource 30B agentic LLM that outperforms Claude 4 Sonnet, DeepSeek v3.1, Kimi k2 on a range of agentic search benchmarks. Only ~3B parameters are activated per token. 100% Opensource.

English

138

840

57.8K

SamT retweetledi

David Sinclair@davidasinclair·17 Eyl

Announcing “K-Dense”, a multi-agent AI scientist that has already made a new discovery in aging research 🧵 @ashwingop & @BioStateAI tinyurl.com/3dmraa5k

English

314

1.9K

557.4K

SamT retweetledi

swyx@swyx·15 Eyl

this is the most important chart on the new gpt-5-codex model We are just beginning to exploit the potential of good routing and variable thinking: Easy responses are now >15x faster, but for the hard stuff, 5-codex now thinks 102% more than 5. Same model, same paradigm, but bending the curve to fit the nonlinearity of coding problems and llm usecases.

OpenAI@OpenAI

We’re releasing GPT-5-Codex — a version of GPT-5 further optimized for agentic coding in Codex. Available in the Codex CLI, IDE Extension, web, mobile, and for code reviews in Github. openai.com/index/introduc…

English

1.1K

193.6K

Keşfet

@GoogleDeepMind @GoogleResearch @BIDMChealth @neural_avb @ilyasut @quocleix @arena @akshay_pachaar