OpenMined

690 posts

OpenMined banner
OpenMined

OpenMined

@openminedorg

We're building open-source tech that helps app builders & researchers get answers from data without direct access to it. Join us on slack → https://t.co/Vuk24CYYnZ

Katılım Ağustos 2017
0 Takip Edilen10.4K Takipçiler
OpenMined retweetledi
Dawn Chen
Dawn Chen@dawnchenx·
Excited to share our preprint on BioVault – a open-source privacy-first platform for global biomedical collaboration using data visitation. 🌍 🔗 biovault.net 📄 biorxiv.org/content/10.648…
English
2
7
19
1.3K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
I've just drafted a new blogpost "GPU demand is (~1Mx) distorted by efficiency problems which are being solved" Mid-2024, Andrej Karpathy trained GPT-2 for $20. Six months later, Andreessen Horowitz reported LLM costs falling 10x annually. Two months after that, DeepSeek shocked markets with radical reductions in training and inference requirements. For AI researchers, this is all good news. For executives, policymakers, and investors forecasting GPU demand... less so. Many were caught off guard. The problem isn’t that executives / policymakers / investors lacked access to information per se… it’s that the technical/non-technical divide prevents them from seeing the difference between waste-based GPU demand and fundamental GPU demand. Meanwhile, tech experts like Karpathy, a16z, and DeepSeek understand fundamental principles which are easy to overlook if you’re not implementing the algorithms yourself. But in presenting their results as merely “AI progress”, they buried the lede… The Lede: If version X of an algorithm achieves the same result as version X-1 at 1/10th the compute cost, what exactly were we paying for in version X-1? The answer has profound implications for anyone forecasting future GPU demand: version X-1 was roughly 90% waste. And a16z’s report, Karpathy’s achievement, and DeepSeek’s breakthrough indicate this isn’t a single 12-month event… it’s a multi-year pattern. Version X-1 was 90% waste. Version X-2 was 99% waste. Version X-3... Wait… Leading AI labs allow waste? The obvious question: if this waste exists at such scale, wouldn’t the labs building these systems have eliminated it already? They are eliminating it. That’s what the 10x annual cost reduction represents. While hardware cost reduction accounts for some of the annual efficiency gain, software updates from AI labs constitute the vast majority… an ~86% efficiency gain annually. The puzzle isn’t whether labs are optimising… clearly they are. The puzzle is why so much waste existed to eliminate in the first place... and how much remains. ... (link on profile page)
⿻ Andrew Trask tweet media
English
31
41
356
69.8K
OpenMined retweetledi
Foresight Institute
Foresight Institute@foresightinst·
We are very excited to announce our amazing speaker line-up for Vision Weekend! Join these field-leading researchers and builders as we explore the frontiers of neurotech, biotech, AI, security, space, and energy! Get tickets: foresight.org/events/vision-… Speakers: • Ed Boyden (Boyden Lab) @eboyden3 • Viren Jain (@Google) @stardazed0 • Chiara Marletto (@UniofOxford) • Laura Deming (@untillabs) @LauraDeming • Alan Mardinly (Science) @mardinly • Andrew Trask (@openminedorg) @iamtrask • Liv Boeree (Win-Win Podcast) @Liv_Boeree • Cate Hall (@AsteraInstitute) @catehall • Adam Brown (@GoogleDeepMind & @Stanford) • Greg Wayne (Google DeepMind) • Joe Betts-LaCroix (@RetroBio_) @bettslacroix • Andrew Payne (@E11BIO) @Andrew_C_Payne • Ariel Ekblaw (@aurelia_labs) @ariel_ekblaw • Eli Dourado (@AsteraInstitute) @elidourado • Gwern Branwen (gwern.net) • Adam Goldstein (Softmax) @adamjgoldstein • Erika Alden DeBenedictis (@Pioneer__Labs) @erika_alden_d • John Hallman (@OpenAI) @johnohallman • Joshua Elliott (@RenPhil21) • Juan Benet (@protocollabs) @juanbenet • Matthew Cullinen (HSBC) • Sean Escola (Protocol Labs/ARNI) • Steve Jurvetson (Future Ventures) @FutureJurvetson • Molly MacKinlay (Protocol Labs) @momack28 • Anastasia Gamick (@Convergent_FROs) @AGamick • Ela Madej (@fiftyyears) @elamadej • Brandon Goldman (@LionheartVC) @BrandonGoldman • Ant Rowstron (@ARIA_research) @rowstron
English
4
7
30
31.8K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
IMO — Ilya is wrong - Frontier LLMs are are trained on ~200 TBs of text - There's ~200 Zettabytes of data out there - That's about 1 billion times more data - It doubles every 2 years The problem is the data is private. Can't scrape it. The problem is not data scarcity, it's data access. The solution is attribution-based control (article below) "Unlocking a Million Times More Data For AI"
Andrew Curran@AndrewCurran_

Ilya Sutskever made a rare appearance at NeurIPS. He said the internet is the fossil fuel of AI, that we are at peak data, and that 'Pre-training as we know it will unquestionably end'.

English
137
79
990
267.5K
OpenMined
OpenMined@openminedorg·
Want to demo/play around with new analysis tech? Are you an academic using AI in your data analysis? We’re building open-source tools to solve using private and unpublished data in AI workflows. Help us help you (5min) bit.ly/3VPxBPp #AI #OpenScience #AcademicTwitter
English
1
3
17
3.6K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
IMO — Decentralized AI is more than: - an AI model in the sky, with good external auditing - an AI model in the sky, which people vote on how to use - an AI model in the sky, which is free for anyone to use - open source AI - federated training None of these are truly an interface to the world's collective intelligence. Each is actually... *mostly* centralized AI... but with the right ambitions!!! In this podcast, I lay out what I think a true decentralized AI ecosystem looks like, and my guesses on how to get there. The key use-case is broad listening (video below describes broad listening) (link to full podcast in reply)
English
6
9
70
64.9K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
Genuine breakthrough in hallucination detection UX, but the fine-tuning approach repeats the exact flaw that creates hallucinations. But that's fixable — which makes me optimistic the hallucination problem is solvable w/ 3 ingredients 1) take this UX breakthrough 2) combine it with attribution breakthroughs 3) layer over the right cryptography tech IMO - that starts to look like a real solution to the hallucination problem. I mean there's work to do, but that looks plausible to me because there's at least *some* way to address the main sub-problems of hallucinations. The central problem of hallucinations is pretty easy to understand: users don't get to choose which pre-training data sources are combined into which tokens (and with what weighting) If they did, AI users could detect and steer AI models around hallucinations pretty easily. Why? Take an example. Let's say you prompted: Prompt: "Who is Kim Kardashian's boyfriend?" FAILURE MODE 1: wrong documents (prompt missed context) If you could observe the LLM starting to use pre-training documents titled: - Kim's Teenage Years - How Kim Rose to Fame - ... you could suspect it's about to hallucinate... becuase it's not indexing into documents from today. FAILURE MODE 2: never heard of Kim Kardashian You might start seeing the LLM using pre-training documents titled: - Disney's Kim Possible vs Ron Stoppable - Kimmy Schmidt is Live on Saturday Night - ... the LLM would still say something grammatical and confident... but it's clearly not focusing on the same thing FAILURE MODE 3: someone poisoned your data If you start seeing the LLM index into pre-training documents from sources that are questionable: - kardashianfanfiction.com - theonion.com If you could see this metadata behind EVERY token an LLM produces - hallucinations would become pretty hard. Every user could ensure their predictions are only coming from documents/sources they trust. Now... why can't you normally see/control what documents from an LLM's pre-training are informing the current tokens? Well, that's a complex topic. But it comes down to the fact that attribution data gets erased during training. Where does it get erased? It gets erased during ADDITION!! Consider the difference between addition and concatenation ADDITION: 2 + 3 = 5 1 + 4 = 5 CONCATENATION: 2 + 3 = 23 1 + 4 = 14 When you add two numbers, you inadvertently erase ANY signal about the original source numbers which created it. You have NO idea what numbers were used to create the number 5... just by looking at the 5. But with concatenation... totally different story. The resulting number (e.g. 23) reveals loads of information about what numbers were used to create it! Ok, let's return to the problem of attribution... why can't you tell which documents are informing which AI tokens? It's because addition is all over the place... two places in particular. When you train an AI model, each weight update is addition, so the influence of different documents gets smeared across all the weights. And of course, when you make a prediction, every matrix multiplication is full of additions and multiplications (both of which have this same property of erasing source information... although multiplication tends to leave more traces). So how do we solve hallucinations? We need to replace additions with enough concatenations that we can see which datapoints contribute to which other ones? This might sounds like a really revolutionary concept... but it's actually super ordinary. Consdier a few examples: - RAG: keep data concatenated in a database - Mixture of Experts: train separate sub-models and pick which model you want at inference time - Model Ensembling: like MoE but simpler - Model Merging: did you know there's whole paradigms for merging models losslessly? (git-rebasin is insane!!) This is why things like RAG are helping with hallucinations. They increase the concatenation-to-addition ratio. So you can imagine a world where: - RAG keeps data separate - Mixture-of-Experts has 1 expert per data source - You can ensemble models from different sources - You can on-the-fly merge models from sources you trust for a specific prompt Keep in mind... several of these techniques are already in production in the SOTA models... Ok... continuing on... we still have some problems we need to solve to help with hallucinations... Even if you use these techniques... the model stil outputs tokens without any metadta about whose data is causing them This is where dual-number systems, and sensitive systems from cryptography are really quite powerful. Things lke differential privacy can calcualte "if i modify the input to a function... how much will the output change?" It turns out... that's the key problem we need to solve for hallucinations. you need to know... "if i removed this pre-training datapoint... how much would the output token prediction change?" If you know the answer to that... you can detect/stop hallucinations. And sensitivity tracking systems like differential privacy can do this (specifically... individual differential privacy...). The problem they usually face is the computational complexity gets INSANE when you do this for highly non-linear functions. But this is where the RAG/MoE/etc. stuff comes in... it linearizes the relationship between input sources and the final prediciton... making sensitivty tracking computationally tractable. But all of this is irrelevant unless we can actually empower end-users to know and control which sources they're using to make predictions (i.e. full "Attribution-based control" or ABC) And this is why the paper I'm quote-tweeting is so exciting. They've got the right INTERFACE. Users need to be able to see at the token level... highlighting which indicates which sources are informing which predictions. The underlying deep learning + cryptography tech is there (RAG/MoE + sensitivity tracking), the interface was a major missing piece. It's an exciting time for people working on hallucinations. Ok to wrap up... this means you could prompt something like: Prompt: Who is Kim Kardashian's boyfriend? and then you'd get token-by-token highlights which give you the % that each token is being informed by different documents/sources from the pre-training data. That's a powerful interface. Anyway... this is what makes me so optimistic that hallucinations can be solved. Few deep learning tweaks, little bit of cryptography... of course there's still some engineering to do to get there... exciting times! For more on attribution-based control - see the link below.
Oscar Balcells Obeso@OBalcells

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

English
6
8
95
15.5K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
IMO — this paper misses the core driver of hallucinations A LLM with a billion neurons is like a billion tiny databases — database per neuron When you prompt it, the LLM looks in all the databases (i.e. neurons) for patterns it recognizes For example, when you prompt "Kim Kardashian is dating ..." The LLM looks in its billions of little hash tables and pulls out patterns: - vocabulary (words like Kim, instagram, etc.) - grammar (subjects -> verb -> object) - semantics (Kim Kardashian's known associates) But here's the problem.... when you prompt it for something unfamiliar, the LLM still recognizes some patterns (e.g. good grammar) - vocabulary (words like Kim, instagram, etc.) - grammar (subjects -> verb -> object) But if it doesn't find all the right cache entries: - semantics (Kim Kardashian's known associates) - date ranges (maybe she dated different people at different times) Then the LLM will make next-token predictions based on the hash-hits it found... but without the benefit of the hash-misses it lacks. So to return to the prompt: "Kim Kardashian is dating ..." - Grammar patterns: the next token will be a noun - Semantic patterns: the next token will be a first name (because "is dating" is usually followed by a name) - Gender pattern: the next token will be a male - Relationship patterns: the next token will be a male Kim is associated with a lot ... but if it can't find the hash-hit in its internal neuraons for the SPECIFIC male she's dating... it can hit on other things.... like - generic male names - males who appear in articles with Kim - other grammatically correct words like "no-one" We call this a hallucination, but IMO it's closer to a cache miss. So how do you solve hallucination? This paper from OpenAI suggests that we solve hallucination by putting "I don't know" in a bunch of the databases. But this isn't how you solve for cache misses — this is just how you create more cache hits of a certain type. If you had a database which was returning erroneous results, would you *fill* the database with "I don't know" entries???... On the one hand, that WOULD increase the chances that the erroneous result was "I don't know"... so you'd make some partial progress at a surface level. But IMO it's not solving the underlying problem... which is closer to detecting the sources/datapoints used for each prediction (MoE, RAG, etc. are making progress on this). IMO - a more fundamental solution would involve solving attribution-based control (link below)
Ethan Mollick@emollick

Paper from OpenAI says hallucinations are less a problem with LLMs themselves & more an issue with training on tests that only reward right answers. That encourages guessing rather than saying “I don’t know” If this is true, there is a straightforward path for more reliable AI.

English
47
81
726
100K
OpenMined retweetledi
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
A different take — when LLMs allow people to summarise (more or less) infinite amounts of content, attention will cease to be a bottleneck as it once was. The attention economy is an imbalance of two things: - broad-casting scale: 1 person can talk to 1 million - broad-listening scale: we each mostly listen to 1 person at a time. This is the problem of "information overload"... BUT .... LLMs can enable you to summarise millions of pieces of content into overall vibes/summaries/reports/etc. LLMs are the beginning of the end of the attention economy.
Andrej Karpathy@karpathy

I often rant about how 99% of attention is about to be LLM attention instead of human attention. What does a research paper look like for an LLM instead of a human? It’s definitely not a pdf. There is huge space for an extremely valuable “research app” that figures this out.

English
16
5
83
17.8K
OpenMined
OpenMined@openminedorg·
How can the UK deliver a National Data Library that actually works? Join us at London Data Week for a hands-on event built for technologists, policy thinkers & public data advocates. 🗓 10 July 📍 UCL East 🎟 bit.ly/4lCIejH 🧵↓
English
2
2
15
3.3K
OpenMined
OpenMined@openminedorg·
We just dropped a new FL library to support researchers and organizations grappling with Federated Learning projects. Syft_Flwr combines Flower’s flexibility with the privacy-preserving networking capabilities of SyftBox. Links in the thread ↓
OpenMined tweet media
English
1
6
31
7.2K
OpenMined
OpenMined@openminedorg·
Demo time! – Learn how enable secure, privacy-preserving data access. Links in the thread ↓
OpenMined tweet media
English
3
2
10
1.9K
OpenMined
OpenMined@openminedorg·
📆 Unlocking Private Data for AI: Join OpenMined’s Masterclass during #NYTechWeek 🔗 Link in the thread ↓
OpenMined tweet media
English
1
2
12
1.9K