dave

1.1K posts

dave banner
dave

dave

@dvxdo

member of technical staff. building agents and evals in a loop

Beigetreten Ağustos 2012
1.8K Folgt576 Follower
dave retweetet
Lou
Lou@louszbd·
interesting new work from Alibaba and WHU (Agentic Memory). most agent memory systems now are basically hardcoded infra, vector db + hand-written rules for when to store/delete/summarize. the model never gets to touch any of it. they made memory ops into actions. add, delete, update, retrieve, summarize, filter, same as calling a tool. then RL trains the whole thing end to end. the neat part is the model discovers on its own that it should proactively clean up its context when things get noisy. nobody wrote a "if tokens > 4k then summarize" rule. And it just emerged from the reward signal. makes you wonder how many other parts of the RAG pipeline are secretly just learnable actions we've been hand-coding for no good reason. arxiv.org/abs/2601.01885
Lou tweet media
English
35
96
724
46.7K
Ebuka l Socrates🦙🐍🦜🦀
"Imagine a 17-year-old boy built an LLM @OkeyMeta from scratch" Meanwhile it is a llama3 model from @GroqInc inference via API Y'all are giving a platform and publicity to a FRAUDULENT PERSON and using age sentiment as a waiver @real_okechukwu you can not deceive everyone
Tosin Olugbenga@TosinOlugbenga

I honestly love how creative this generation is. Imagine a 17-year-old boy built an LLM @OkeyMeta from scratch. At 17!, this is the kind of story that makes you stop and think about how much potential young people carry. Join me this Thursday with @real_okechukwu as we hear his journey firsthand. 🔗 x.com/i/spaces/1lpkq…

English
41
36
295
73.8K
dave
dave@dvxdo·
Happy to know you took the time to read the paper :) yes the model is part of GPT-J family, we never stated we invented a new architecture. Similarly to almost every SoTA model out today being based on an MoE architecture, or the like of Mistral, Gemma, Qwen all based on Llama’s architecture
English
1
0
0
31
dave retweetet
nanda
nanda@nandafyi·
New post 🎉 Going back to my roots on writing about the inner workings of things, a breakdown of key-value databases and how you might make one from scratch: nan.fyi/database
nanda tweet media
English
70
250
3K
328.7K
dave retweetet
Andrew Ng
Andrew Ng@AndrewYNg·
Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error analysis (identifying the causes of errors). It’s tempting to shortcut these processes and to quickly attempt fixes to mistakes rather than slowing down to identify the root causes. But evals and error analysis can lead to much faster progress. In this first of a two-part letter, I’ll share some best practices for finding and addressing issues in agentic systems. Even though error analysis has long been an important part of building supervised learning systems, it is still underappreciated compared to, say, using the latest and buzziest tools. Identifying the root causes of particular kinds of errors might seem “boring,” but it pays off! If you are not yet persuaded that error analysis is important, permit me to point out: - To master a composition on a musical instrument, you don’t only play the same piece from start to end. Instead, you identify where you’re stumbling and practice those parts more. - To be healthy, you don’t just build your diet around the latest nutrition fads. You also ask your doctor about your bloodwork to see if anything is amiss. (I did this last month and am happy to report I’m in good health! 😃) - To improve your sports team’s performance, you don’t just practice trick shots. Instead, you review game films to spot gaps and then address them. To improve your agentic AI system, don’t just stack up the latest buzzy techniques that just went viral on social media (though I find it fun to experiment with buzzy AI techniques as much as the next person!). Instead, use error analysis to figure out where it’s falling short, and focus on that. Before analyzing errors, we first have to decide what is an error. So the first step is to put in evals. I’ll focus on that for the remainder of this letter and discuss error analysis next week. If you are using supervised learning to train a binary classifier, the number of ways the algorithm could make a mistake is limited. It could output 0 instead of 1, or vice versa. There is also a handful of standard metrics like accuracy, precision, recall, F1, ROC, etc. that apply to many problems. So as long as you know the test distribution, evals are relatively straightforward, and much of the work of error analysis lies in identifying what types of input an algorithm fails on, which also leads to data-centric AI techniques for acquiring more data to augment the algorithm in areas where it’s weak. With generative AI, a lot of intuitions from evals and error analysis of supervised learning carry over — history doesn’t repeat itself, but it rhymes — and developers who are already familiar with machine learning and deep learning often adapt to generative AI faster than people who are starting from scratch. But one new challenge is that the space of outputs is much richer, so there are many more ways an algorithm’s output might be wrong. Take the example of automated processing of financial invoices where we use an agentic workflow to populate a financial database with information from received invoices. Will the algorithm incorrectly extract the invoice due date? Or the final amount? Or mistake the payer address for the biller address? Or get the financial currency wrong? Or make the wrong API call so the verification process fails? Because the output space is much larger, the number of failure modes is also much larger. Rather than defining an error metric ahead of time, it is therefore typically more effective to first quickly build a prototype, then manually examine a handful of agent outputs to see where it performs well and where it stumbles. This allows you to focus on building datasets and error metrics — sometimes objective metrics implemented in code, and sometimes subjective metrics using LLM-as-judge — to check the system’s performance in the dimensions you are most concerned about. In supervised learning, we sometimes tune the error metric to better reflect what humans care about. With agentic workflows, I find tuning evals to be even more iterative, with more frequent tweaks to the evals to capture the wider range of things that can go wrong. I discuss this and other best practices in detail in Module 4 of the Agentic AI course on deeplearning.ai that we announced last week. After building evals, you now have a measurement of your system’s performance, which provides a foundation for trying different modifications to your agent, as you can now measure what makes a difference. The next step is then to perform error analysis to pinpoint what changes to focus your development efforts on. I’ll discuss this further next week. [Original text: deeplearning.ai/the-batch/issu… ]
English
85
286
1.7K
311.9K
dave retweetet
Tom Yeh
Tom Yeh@ProfTomYeh·
Copy-pasting PyTorch code is fast — using an AI coding model is even faster — but both skip the learning. That's why I asked my students to write by hand ✍️. 🔽 Download: byhand.ai/pytorch After the exercise, my students can understand what every line really does and connect it to the math. You can too!
English
9
75
521
34K
dave retweetet
Unsloth AI
Unsloth AI@UnslothAI·
LoRA in reinforcement learning (RL) can match full-finetuning performance when done right! 💡 A new @thinkymachines post shows how using 10x larger learning rates, applying LoRA on all layers & more, LoRA at rank=1 even works. We're excited to have collaborated on this blog!
Unsloth AI tweet media
Thinking Machines@thinkymachines

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English
20
146
953
72.7K
dave retweetet
dave
dave@dvxdo·
quality slop
Ahmad@TheAhmadOsman

> be you > want to actually learn how LLMs work > sick of “just start with linear algebra and come back in 5 years” > decide to build my own roadmap > no fluff. no detours. no 200-hour generic ML playlists > just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention” > goal: build, fine-tune, and ship LLMs > not vibe with them. not "learn the theory" forever > build them > you will: > > build an autograd engine from scratch > > write a mini-GPT from scratch > > implement LoRA and fine-tune a model on real data > > hate CUDA at least once > > cry > > keep going > 5 phases > if you already know something? skip > if you're lost? rewatch > if you’re stuck? use DeepResearch > this is a roadmap, not a leash > by the end: you either built the thing or you didn’t > phase 0: foundations > > if matrix multiplication is scary, you’re not ready yet > > watch 3Blue1Brown’s linear algebra series > > MIT 18.06 with Strang, yes, he’s still the GOAT > > code Micrograd from scratch (Karpathy) > > train a mini-MLP on MNIST > > no frameworks, no shortcuts, no mercy > phase 1: transformers > > the name is scary > > it’s just stacked matrix multiplies and attention blocks > > Jay Alammar + 3Blue1Brown for the “aha” > > Stanford CS224N for the theory > > read "Attention Is All You Need" only AFTER building mental models > > Karpathy's "Let's Build GPT" will break your brain in a good way > > project: build a decoder-only GPT from scratch > > bonus: swap tokenizers, try BPE/SentencePiece > phase 2: scaling > > LLMs got good by scaling, not magic > > Kaplan paper -> Chinchilla paper > > learn Data, Tensor, Pipeline parallelism > > spin up multi-GPU jobs using HuggingFace Accelerate > > run into VRAM issues > > fix them > > welcome to real training hell > phase 3: alignment & fine-tuning > > RLHF: OpenAI blog -> Ouyang paper > > SFT -> reward model -> PPO (don’t get lost here) > > Anthropic's Constitutional AI = smart constraints > > LoRA/QLoRA: read, implement, inject into HuggingFace models > > fine-tune on real data > > project: fine-tune gpt2 or distilbert with your own adapters > > not toy examples. real use cases or bust > phase 4: production > this is the part people skip to, but you earned it > inference optimization: FlashAttention, quantization, sub-second latency > read the paper, test with quantized models > resources: > math/coding: > > 3Blue1Brown, MIT 18.06, Goodfellow’s book > PyTorch: > > Karpathy, Zero to Mastery > > transformers: > > Alammar, Karpathy, CS224N, Vaswani et al > > scaling: > > Kaplan, Chinchilla, HuggingFace Accelerate > > alignment: > > OpenAI, Anthropic, LoRA, QLoRA > > inference: > > FlashAttention > the endgame: > > understand how these models actually work > > see through hype > > ignore LinkedIn noise > > build tooling > > train real stuff > > ship your own stack > > look at a paper and think “yeah I get it” > > build your own AI assistant, infra, whatever > make it all the way through? > ship something real? > DM me. > I wanna see what you built. > happy hacking.

English
0
0
3
80
dave
dave@dvxdo·
I just realized that I have been building out a “demo project” with Kubernetes and load balancers even after squeezing out 10k request/secs from it
English
0
0
0
53
dave
dave@dvxdo·
@aigeek__ @jobergum yeah, that was the initial pushback but then people started suggesting skipping evals altogether, instead of pushing for DIY evals
English
0
0
1
22
ai geek (wishesh) ⚡️
ai geek (wishesh) ⚡️@aigeek__·
@jobergum @DavidOkpare i think, it is more of a revolt against the tools than the fact that people New’s to start looking at their data more closely to get better results.
English
1
0
1
118
Jo Kristian Bergum
Jo Kristian Bergum@jobergum·
The upside of the eval wars is that it triggered creation of a lot of high alpha content
English
9
6
60
6K