Yuval Neeman

389 posts

Yuval Neeman

Yuval Neeman

@yuvaln

Technologist, investor, cheer leader, father

Seattle, WA Katılım Mart 2009
307 Takip Edilen187 Takipçiler
Yuval Neeman
Yuval Neeman@yuvaln·
100% Try @opik from @Cometml An open source framework to track, eval and optimize agentic applications
Andrew Ng@AndrewYNg

Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error analysis (identifying the causes of errors). It’s tempting to shortcut these processes and to quickly attempt fixes to mistakes rather than slowing down to identify the root causes. But evals and error analysis can lead to much faster progress. In this first of a two-part letter, I’ll share some best practices for finding and addressing issues in agentic systems. Even though error analysis has long been an important part of building supervised learning systems, it is still underappreciated compared to, say, using the latest and buzziest tools. Identifying the root causes of particular kinds of errors might seem “boring,” but it pays off! If you are not yet persuaded that error analysis is important, permit me to point out: - To master a composition on a musical instrument, you don’t only play the same piece from start to end. Instead, you identify where you’re stumbling and practice those parts more. - To be healthy, you don’t just build your diet around the latest nutrition fads. You also ask your doctor about your bloodwork to see if anything is amiss. (I did this last month and am happy to report I’m in good health! 😃) - To improve your sports team’s performance, you don’t just practice trick shots. Instead, you review game films to spot gaps and then address them. To improve your agentic AI system, don’t just stack up the latest buzzy techniques that just went viral on social media (though I find it fun to experiment with buzzy AI techniques as much as the next person!). Instead, use error analysis to figure out where it’s falling short, and focus on that. Before analyzing errors, we first have to decide what is an error. So the first step is to put in evals. I’ll focus on that for the remainder of this letter and discuss error analysis next week. If you are using supervised learning to train a binary classifier, the number of ways the algorithm could make a mistake is limited. It could output 0 instead of 1, or vice versa. There is also a handful of standard metrics like accuracy, precision, recall, F1, ROC, etc. that apply to many problems. So as long as you know the test distribution, evals are relatively straightforward, and much of the work of error analysis lies in identifying what types of input an algorithm fails on, which also leads to data-centric AI techniques for acquiring more data to augment the algorithm in areas where it’s weak. With generative AI, a lot of intuitions from evals and error analysis of supervised learning carry over — history doesn’t repeat itself, but it rhymes — and developers who are already familiar with machine learning and deep learning often adapt to generative AI faster than people who are starting from scratch. But one new challenge is that the space of outputs is much richer, so there are many more ways an algorithm’s output might be wrong. Take the example of automated processing of financial invoices where we use an agentic workflow to populate a financial database with information from received invoices. Will the algorithm incorrectly extract the invoice due date? Or the final amount? Or mistake the payer address for the biller address? Or get the financial currency wrong? Or make the wrong API call so the verification process fails? Because the output space is much larger, the number of failure modes is also much larger. Rather than defining an error metric ahead of time, it is therefore typically more effective to first quickly build a prototype, then manually examine a handful of agent outputs to see where it performs well and where it stumbles. This allows you to focus on building datasets and error metrics — sometimes objective metrics implemented in code, and sometimes subjective metrics using LLM-as-judge — to check the system’s performance in the dimensions you are most concerned about. In supervised learning, we sometimes tune the error metric to better reflect what humans care about. With agentic workflows, I find tuning evals to be even more iterative, with more frequent tweaks to the evals to capture the wider range of things that can go wrong. I discuss this and other best practices in detail in Module 4 of the Agentic AI course on deeplearning.ai that we announced last week. After building evals, you now have a measurement of your system’s performance, which provides a foundation for trying different modifications to your agent, as you can now measure what makes a difference. The next step is then to perform error analysis to pinpoint what changes to focus your development efforts on. I’ll discuss this further next week. [Original text: deeplearning.ai/the-batch/issu… ]

English
0
0
0
41
Comet
Comet@Cometml·
⭐ Opik has officially passed 10,000 GitHub Stars! ⭐ Everyone who works at Comet has contributed to Opik over the last 9 months, but the key to Opik’s rapid growth has been its community—meaning you 🤩 In that light, we want to thank some of the people who’ve helped us 🧵
Comet tweet media
English
5
4
21
1.8K
Yuval Neeman retweetledi
Nimrod Lahav
Nimrod Lahav@Nimrod_Lahav·
#opik is trending again on GitHub! In just six months, we've seen incredible adoption from the AI developer community contributions, issues, feature ideas, and massive enterprise adoption. The momentum is real! Haven’t tried Opik yet? bit.ly/3DC9T3K @Cometml
Nimrod Lahav tweet media
English
2
4
12
6.2K
Yuval Neeman
Yuval Neeman@yuvaln·
Great perspective!
Yishan@yishan

I think the Deepseek moment is not really the Sputnik moment, but more like the Google moment. If anyone was around in ~2004, you'll know what I mean, but more on that later. I think everyone is over-rotated on this because Deepseek came out of China. Let me try to un-rotate you. Deepseek could have come out of some lab in the US Midwest. Like say some CS lab couldn't afford the latest nVidia chips and had to use older hardware, but they had a great algo and systems department, and they found a bunch of optimizations and trained a model for a few million dollars and lo, the model is roughly on par with o1. Look everyone, we found a new training method and we optimized a bunch of algorithms! Everyone is like OH WOW and starts trying the same thing. Great week for AI advancement! No need for US markets to lose a trillion in market cap. The tech world (and apparently Wall Street) is massively over-rotated on this because it came out of CHINA. I get it. After everyone has been sensitized over the H1BLM uproar, we are conditioned to think of OMG Immigrants China as some kind of Alien Other. As though the Alien-Other Chinese Researchers are doing something special that's out of reach and now China The Empire is somehow uniquely in possession of Super Efficient AI Power and the US companies can't compete. The subtext of "A New Fearsome Power Now Under The Command of the CCP" is what's driving the current sentiment, and it's not really valid. Like, no. These are guys basically working on the same problems we are in the US, and not only that, they wrote a paper about it and open-sourced their model! It is not actually some sort of tectonic geopolitical shift, it is just Some Nerds Over There saying "Hey we figured out some cool shit, here's how we did it, maybe you would like to check it out?" Sputnik showed that the Soviets could do something the US couldn't ("a new fearsome power"). They didn't subsequently publish all the technical details and half the blueprints. They only showed that it could be done. With Deepseek, if I recall correctly, a lab in Berkeley read their paper and duplicated the claimed results on a small scale within a day. That's why I say it's like the Google moment in 2004. Google filed its S-1 in 2004, and revealed to the world that they had built the largest supercomputer cluster by using distributed algorithms to network together commodity computers at the best performance-per-dollar point on the cost curve. This was in contrast to every other tech company, who at that time just bought what were essentially larger and larger mainframes, always at the most expensive leading edge of the cost curve. (To the young people reading this, this will sound incredible to you) I worked at PayPal at the time, and in order to keep pace with the rising transaction volume, the company was forced to buy bigger and bigger database servers from Oracle. We were totally Oracle's bitch. At one point when we ran into scalability issues, the Oracle reps told us we were their biggest installation so they had no other reference point on how to help us overcome our scalability issues. We literally resorted to flipping random config switches and rebooting it. (This heavily influenced me when I was a young manager later at Facebook. I deliberately torpedoed an Oracle salesman's pitch to try and get us to switch from open source MySQL databases to an Oracle contract: of course we had scalability problems, but at least when we had them, we could open up the hood and figure out how to fix it ... assuming we had good enough engineers, and we did. When it's closed-source infra, you're at the mercy of the vendor's support engineers) Back to Google - in their S-1, they described how they were able to leapfrog the scalability limits of mainframes and had been (for years!) running a far more massive networked supercomputer comprised of thousands of commodity machines at the optimal performance-per-dollar price point - i.e. not the more expensive leading edge - all knit together by fault-tolerant distributed algorithms written in-house. Some time later, Google published their MapReduce and BigTable papers, describing the algorithms they'd used to manage and control this massively more cost-effective and powerful supercomputer. Deepseek is MUCH more like the Google moment, because Google essentially described what it did and told everyone else how they could do it too. In Google's case, a fair bit of time elapsed between when they revealed to the world what they were doing and when they published a papers showing everyone how to do it. Deepseek, in contrast, published their paper alongside the model release. Now, I've also written about how I think this is also a demonstration of Deepseek's trajectory, but that's also no different from Google in ~2004 revealing what it was capable of. Competitors will still need to gear up and DO the thing, but they've moved the field forward. But it's not like Sputnik where the Soviets have developed technology unreachable to the US, it's more like Google saying, "Hey, we did this cool thing, here's how we did it." There is no reason to think nVidia and OAI and Meta and Microsoft and Google et al are dead. Sure, Deepseek is a new and formidable upstart, but doesn't that happen every week in the world of AI? I am sure that Sam and Zuck, backed by the power of Satya, can figure something out. Everyone is going to duplicate this feat in a few months and everything just got cheaper. The only real consequence is that AI utopia/doom is now closer than ever. ==== Bonus: This is also a little similar the Ethereum PoS moment, when AI finally has a counterpoint to the environmentalists who say AI uses so much electricity. We just brought down the cost of inference by 97%!

English
0
0
3
77
Yuval Neeman
Yuval Neeman@yuvaln·
100%
aviel@aviel

@gidim And never forget how powerful constraints are at driving innovation. It’s what makes startups well… startups.

QST
0
0
1
461
Yuval Neeman
Yuval Neeman@yuvaln·
Checkout Comet/Opik on Github GitHub - comet-ml/opik: Open-source end-to-end LLM Development Platform
Yuval Neeman tweet media
English
0
0
4
95
Yuval Neeman retweetledi
Hemda Arad
Hemda Arad@HemdaArad·
עמיתות ועמיתים יקרים, אני גאה לבשר על פרסומו של ספרי פסיכואנליזה התייחסותית בשילוב שיטת EMDR: חוויה מגולמת בגוף, תיאוריה ופרקטיקה קלינית, בעברית. מקווה שישמש אתכם/ן במאמץ להגיע אל המטופלים שכה זקוקים לכל תמיכה פסיכולוגית בעת המאתגרת שבה אנו שרויים.
עברית
8
2
1
149
Yuval Neeman retweetledi
Zuplo
Zuplo@Zuplo·
We’ve made monetizing APIs easier than ever! With Zuplo, you can: 💰 Set up monetization in minutes 💰 Tailor credit systems to meet your needs 💰 Take all the API revenue & so much more! Try Zuplo for free👇 zuplo.com/features/api-m…
Zuplo tweet media
English
3
4
10
887
Yuval Neeman retweetledi
Ahmed Fouad Alkhatib
Ahmed Fouad Alkhatib@afalkhatib·
🧵1/17 I just listened to a live Twitter discussion in Arabic hosted by multiple Palestinian dissidents & activists, many from #Gaza, titled: "Gazan perspectives trample the opinions of Twitter Mujahideen." They tore into #Hamas & offered opinions on the war in Gaza. Key points:
English
330
2.7K
8K
2.7M