Haseeb Raja

21.4K posts

Haseeb Raja banner
Haseeb Raja

Haseeb Raja

@rh__147

Is it just me or is it getting crazier out there?

Seoul, Republic of Korea Katılım Temmuz 2015
99 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
Haseeb Raja
Haseeb Raja@rh__147·
Just to make this my pinned tweet kek
Haseeb Raja tweet media
English
0
0
0
0
Thariq
Thariq@trq212·
a prompt I've been using a lot recently: implement <SPEC> and while you do, keep a running implementation-notes.html file (or markdown) with decisions you had to make weren't in the spec, things you had to change, tradeoffs you had to make or anything else I should know
Thariq tweet media
English
341
580
9.7K
813.8K
Haseeb Raja
Haseeb Raja@rh__147·
@rasbt Pretty satisfying to see everyone taking part in the open-source push especially Qwen. Currently playing around with 9B on my mac for reflection over toy instrtuction dataset. I wish Meta was still there and kept focussing on mid-size open models instead of big-size leap.
English
1
0
6
1.6K
Sebastian Raschka
Sebastian Raschka@rasbt·
While waiting for DeepSeek V4 we got two very strong open-weight LLMs from India yesterday. There are two size flavors, Sarvam 30B and Sarvam 105B model (both reasoning models). Interestingly, the smaller 30B model uses “classic” Grouped Query Attention (GQA), whereas the larger 105B variant switched to DeepSeek-style Multi-Head Latent Attention (MLA). As I wrote about in my analyses before, both are popular attention variants to reduce KV cache size (the longer the context, the more you save compared to regular attention). MLA is more complicated to implement, but it can give you better modeling performance if we go by the ablation studies in the 2024 DeepSeek V2 paper (as far as I know, this is still the most recent apples-to-apples comparison). Speaking of modeling performance, the 105B model is on par with LLMs of similar size: gpt-oss 120B and Qwen3-Next (80B). Sarvam is better on some tasks and worse on others, but roughly the same on average. It’s not the strongest coder in SWE-Bench Verified terms, but it is surprisingly good at agentic reasoning and task completion (Tau2). It’s even better than Deepseek R1 0528. Considering the smaller Sarvam 30B, the perhaps most comparable model to the 30B model is Nemotron 3 Nano 30B, which is slightly ahead in coding per SWE-Bench Verified and agentic reasoning (Tau2) but slightly worse in some other aspects (Live Code Bench v6, BrowseComp). Unfortunately, Qwen3-30B-A3B is missing in the benchmarks, which is, as far as I know, is the most popular model of that size class. Interestingly, though, the Sarvam team compared their 30B model to Qwen3-30B-A3B on a computational performance analysis, where they found that Sarvam gets 20-40% more tokens/sec throughput compared to Qwen3 due to code and kernel optimizations. Anyways, one thing that is not captured by the benchmarks above is Sarvam’s good performance on Indian languages. According to a judge model, the Sarvam team found that their model is preferred 90% of the time compared to others when it comes to Indian texts. (Since they built and trained the tokenizer from scratch as well, Sarvam also comes with a 4 times higher token efficiency on Indian languages.
Sebastian Raschka tweet media
Pratyush Kumar@pratykumar

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and AIKosh. Thanks to the good folks at SGLang for day 0 support, vLLM support coming soon. Links, benchmark scores, examples, and more in our blog - sarvam.ai/blogs/sarvam-3…

English
45
683
4.1K
254.7K
Self
Self@SelfInfinity·
@rh__147 @AndrewYNg I pay for the service they offer, but I also use @syncthing (opensource) to sync the obsidian vault between devices (eg between macOS and Linux)
English
1
0
0
50
Andrew Ng
Andrew Ng@AndrewYNg·
AI agents are getting better at looking at different types of data in businesses to spot patterns and create value. This is making data silos increasingly painful. This is why I increasingly try to select software that lets me control my own data, so I can make it available to my AI agents. Because of AI’s growing capabilities, the value you can now create from “connecting the dots” between different pieces of data is higher than ever. For example, if an email click is logged in one vendor’s system and a subsequent online purchase is logged in a different one, then it is valuable to build agents that can access both of these data sources to see how they correlate to make better decisions. Unfortunately, many SaaS vendors try to create a data silo in their customer’s business. By making it hard for you to extract your data, they create high switching costs. This also allows them to steer you to buy their AI agent services — sometimes at high expense and/or of low quality — rather than build your own or buy from a different vendor. Unfortunately, some SaaS vendors are seeing AI agents coming for this data and working to make it harder for you (and your AI agents) to efficiently access it. One of my teams just told me that a SaaS vendor we have been using to store our customer data wants to charge over $20,000 for an API key to get at our data. This high cost — no doubt intentionally designed to make it hard for customers to get their data out — is adding a barrier to implementing agentic workflows that take advantage of that data. Through AI Aspire (an AI advisory firm), I advise a number of businesses on their AI strategies. When it comes to buying SaaS, I often advise them to try to control their own data (which, sadly, some vendors mightily resist). This way, you can hire a SaaS vendor to record and operate on your data, but ultimately you decide how to route it to the appropriate human or AI system for processing. Over the past decade, a lot of work has gone into organizing businesses’ structured data. Because AI can now process unstructured data much better than before, the value of organizing your unstructured data (including PDF files, which LandingAI’s Agentic Document Extraction specializes in!) is higher than ever before. In the era of generative AI, businesses and individuals have important work ahead to organize their data to be AI-ready. P.S. As an individual, my favorite note-taking app is Obsidian. I am happy to “hire” Obsidian to operate on my notes files. And, all my notes are saved as Markdown files in my file system, and I have built AI agents that read from or write to my Obsidian files. This is a small example of how controlling my own notes data lets me do more with AI agents! [Original text: deeplearning.ai/the-batch/issu… ]
English
130
335
2K
168.3K
Self
Self@SelfInfinity·
@AndrewYNg 👌 for Obsidian. Opensource, markdown, great sync. No need for any saas note taking app.
English
2
1
9
3.7K
Surya
Surya@sdand·
What if next-token prediction wasn't a single forward pass, but a tiny optimization problem? Introducing: nanoEBM a tiny transformer that learns to think harder by doing gradient descent on its own predictions. You can start training on your Mac now - it comes < 400 lines
GIF
English
22
73
718
69.5K
Haseeb Raja
Haseeb Raja@rh__147·
@bryan_johnson An AI engineer here. Eating meet or not is a subjective truth. However, don’t go out claiming “very soon” AI will be dominant. Firstly, we are not even remotely close. Second, we lack the most important ingredient at the moment i.e. EQ.
English
0
0
0
7
Bryan Johnson
Bryan Johnson@bryan_johnson·
I don’t eat meat for two reasons. First, scientific evidence paints a path to optimal health without it. Second, very soon, AI may be as powerful and dominant to us as we are to animals. It’s prudent to keep this in mind.
English
1.6K
264
4.5K
1M
Haseeb Raja
Haseeb Raja@rh__147·
@fchollet Finding new algorthm is not LLMs job. That discipline is RL! We always improve the data and detail level for our models to learn better. Explicit is better than implicit. Sure LMs learn pattern but they won’t if we feed them garbage.
English
0
0
0
14
Haseeb Raja
Haseeb Raja@rh__147·
@fchollet I think you are missing the point. Your model is as good as the data. In the past we used raw text and images. Then we started preprocessing the data i.e. making data good. Then came the planning and reasoning, and now we are improving data for those.
English
1
0
1
96
François Chollet
François Chollet@fchollet·
You can teach a Transformer to execute a simple algorithm if you provide the exact step by step algorithm during training via CoT tokens. This is interesting, but the point of machine learning should be to *find* the algorithm during training, from input/output pairs only -- not just memorize an externally provided algorithm. Pretty trivial program synthesis techniques can achieve just that in the case of multiplication. Because if you already have the algorithm, you can just write it down and execute it instead of training a Transformer to inefficiently encode it.
Rohan Paul@rohanpaul_ai

A beautiful paper from MIT+Harvard+ @GoogleDeepMind 👏 Explains why Transformers miss multi digit multiplication and shows a simple bias that fixes it. The researchers trained two small Transformer models on 4-digit-by-4-digit multiplication. One used a special training method called implicit chain-of-thought (ICoT), where the model first sees every intermediate reasoning step, and then those steps are slowly removed as training continues. This forces the model to “think” internally rather than rely on the visible steps. That model learned the task perfectly — it produced the right answer for every example (100% accuracy). The other model was trained the normal way, called standard fine-tuning, where it only saw the input numbers and the final answer, not the reasoning steps. That model almost completely failed — it only got about 1% of the answers correct. i.e. model trained with implicit chain of thought, called ICoT, gets 100% on 4x4 multiplication while normal training could not learn it at all

English
83
195
2.1K
281.9K
ThePrimeagen
ThePrimeagen@ThePrimeagen·
i am a failure
ThePrimeagen tweet media
English
189
17
1.8K
110.7K
“paula”
“paula”@paularambles·
> lunch with european friend visiting sf > we split the bill, it's $60 each > apple pay prompts 25%, 20%, 15% > hoping he tips at least 20% > he selects “other” > i think wow okay, generous > he tips $5
English
999
93
13.6K
1.8M
Google Gemini
Google Gemini@GeminiApp·
We're upgrading our AI subscription plans to Google Al Pro and Google Al Ultra. Google AI Ultra is perfect for our Gemini app power users, the trailblazers who want the highest rate limits and early access to our most capable models and features, including Veo 3.
Google Gemini tweet media
English
198
307
3.1K
437.7K
Nate Brown
Nate Brown@ntbrown01·
@theo Grok is worth every penny if you’re a developer and/or you like to build your own tools.
English
9
0
4
3K
Theo - t3.gg
Theo - t3.gg@theo·
$50/month?? Are they insane
Theo - t3.gg tweet media
English
254
66
4.3K
413.9K
Haseeb Raja
Haseeb Raja@rh__147·
@rasbt @karpathy They must be using summarized conversation buffers right? But I wonder what happens if the summary also exceeds the context limit.
English
0
0
0
86
Sebastian Raschka
Sebastian Raschka@rasbt·
@karpathy I always wonder what happens under the hood when the conversation exceeds the LLM-supported context limit for services like ChatGPT. Do they a) simply cut it off like `previous_tokens[-supported_context_length:]` b) summarize/compress the previous chat history
English
31
3
141
27.5K
Andrej Karpathy
Andrej Karpathy@karpathy·
When working with LLMs I am used to starting "New Conversation" for each request. But there is also the polar opposite approach of keeping one giant conversation going forever. The standard approach can still choose to use a Memory tool to write things down in between conversations (e.g. ChatGPT does so), so the "One Thread" approach can be seen as the extreme special case of using memory always and for everything. The other day I've come across someone saying that their conversation with Grok (which was free to them at the time) has now grown way too long for them to switch to ChatGPT. i.e. it functions like a moat hah. LLMs are rapidly growing in the allowed maximum context length *in principle*, and it's clear that this might allow the LLM to have a lot more context and knowledge of you, but there are some caveats. Few of the major ones as an example: - Speed. A giant context window will cost more compute and will be slower. - Ability. Just because you can feed in all those tokens doesn't mean that they can also be manipulated effectively by the LLM's attention and its in-context-learning mechanism for problem solving (the simplest demonstration is the "needle in the haystack" eval). - Signal to noise. Too many tokens fighting for attention may *decrease* performance due to being too "distracting", diffusing attention too broadly and decreasing a signal to noise ratio in the features. - Data; i.e. train - test data mismatch. Most of the training data in the finetuning conversation is likely ~short. Indeed, a large fraction of it in academic datasets is often single-turn (one single question -> answer). One giant conversation forces the LLM into a new data distribution it hasn't seen that much of during training. This is in large part because... - Data labeling. Keep in mind that LLMs still primarily and quite fundamentally rely on human supervision. A human labeler (or an engineer) can understand a short conversation and write optimal responses or rank them, or inspect whether an LLM judge is getting things right. But things grind to a halt with giant conversations. Who is supposed to write or inspect an alleged "optimal response" for a conversation of a few hundred thousand tokens? Certainly, it's not clear if an LLM should have a "New Conversation" button at all in the long run. It feels a bit like an internal implementation detail that is surfaced to the user for developer convenience and for the time being. And that the right solution is a very well-implemented memory feature, along the lines of active, agentic context management. Something I haven't really seen at all so far. Anyway curious to poll if people have tried One Thread and what the word is.
English
665
550
6.6K
830.5K
Haseeb Raja
Haseeb Raja@rh__147·
@karpathy New chats for quick work and research. Projects for mid to long-term projects. Use hand-over prompt each time the conversation gets long. I think works well for many cases.
English
0
0
0
10
Andrew Ng
Andrew Ng@AndrewYNg·
Some people today are discouraging others from learning programming on the grounds AI will automate it. This advice will be seen as some of the worst career advice ever given. I disagree with the Turing Award and Nobel prize winner who wrote, “It is far more likely that the programming occupation will become extinct [...] than that it will become all-powerful. More and more, computers will program themselves.”​ Statements discouraging people from learning to code are harmful! In the 1960s, when programming moved from punchcards (where a programmer had to laboriously make holes in physical cards to write code character by character) to keyboards with terminals, programming became easier. And that made it a better time than before to begin programming. Yet it was in this era that Nobel laureate Herb Simon wrote the words quoted in the first paragraph. Today’s arguments not to learn to code continue to echo his comment. As coding becomes easier, more people should code, not fewer! Over the past few decades, as programming has moved from assembly language to higher-level languages like C, from desktop to cloud, from raw text editors to IDEs to AI assisted coding where sometimes one barely even looks at the generated code (which some coders recently started to call vibe coding), it is getting easier with each step. I wrote previously that I see tech-savvy people coordinating AI tools to move toward being 10x professionals — individuals who have 10 times the impact of the average person in their field. I am increasingly convinced that the best way for many people to accomplish this is not to be just consumers of AI applications, but to learn enough coding to use AI-assisted coding tools effectively. One question I’m asked most often is what someone should do who is worried about job displacement by AI. My answer is: Learn about AI and take control of it, because one of the most important skills in the future will be the ability to tell a computer exactly what you want, so it can do that for you. Coding (or getting AI to code for you) is a great way to do that. When I was working on the course Generative AI for Everyone and needed to generate AI artwork for the background images, I worked with a collaborator who had studied art history and knew the language of art. He prompted Midjourney with terminology based on the historical style, palette, artist inspiration and so on — using the language of art — to get the result he wanted. I didn’t know this language, and my paltry attempts at prompting could not deliver as effective a result. Similarly, scientists, analysts, marketers, recruiters, and people of a wide range of professions who understand the language of software through their knowledge of coding can tell an LLM or an AI-enabled IDE what they want much more precisely, and get much better results. As these tools are continuing to make coding easier, this is the best time yet to learn to code, to learn the language of software, and learn to make computers do exactly what you want them to do. [Original text: deeplearning.ai/the-batch/issu… ]
English
514
2.8K
11.9K
2.1M
Haseeb Raja
Haseeb Raja@rh__147·
@davidtsolheim @AndrewYNg Coding is not just typing syntax, or typing syntax with the help of AI. Coding is just a problem solving tool. Every engineer needs to be a good problem solver.
English
1
0
1
244
David Solheim
David Solheim@davidtsolheim·
People would be better off learning how to code using AI, then learning how to code from scratch. I understand where you’re coming from, especially as a cofounder of a company that sells courses, but to say that people should learn programming for the sake of programming is not good advice.
English
3
0
7
4.4K
Haseeb Raja
Haseeb Raja@rh__147·
@lexfridman Zelenskyy acted like a real leader. Only adult in the room.
English
0
0
1
8