Dieter Castel

6.5K posts

Dieter Castel banner
Dieter Castel

Dieter Castel

@DieterCastel

Engineer, ex-@NVISOsecurity, Alumnus CompSci @CW_KULeuven Math/STEAM/ML/@julialangu Enthusiast, Traceur, Multi-Genre music lover. #MathsJam @stadleuven cohost.

Leuven - Belgium Katılım Şubat 2014
2.2K Takip Edilen418 Takipçiler
Dieter Castel retweetledi
Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)
No. Training LLMs on purely factual data STILL WON'T cure them of "Hallucinations" #SundayHarangue There is a persistent myth that LLM hallucinations are just a result of them being trained on un-curated and "non-factual" data, and will go away with high quality/factual data. This misses the basic n-gram structure of LLMs. Yes, the presence of "non-factual" training data does increase the chance of producing "non-factual" completions. But, even if you train LLMs only on factual data (and I will suspend my disbelief for a minute about the impossibility of doing that in a multi-polar world), LLMs can and will still continue to produce completions that are not factual! A simplistic way to visualize it is this: Imagine you have access to a 1000 curated wikipedia documents. Don't you think that by selectively cutting pasting from those documents, you can generate an inaccurate/not-fully-factual new one? This happens because LLMs are completing the prompt probabilistically conditioned on the training corpus ("approximate retrieval") rather than indexing and retrieving like (the boring and much maligned) databases! (See x.com/rao2z/status/1…; quoted below). The fact that factuality of the training data is not sufficient to avoid hallucinations is demonstrated in multiple ways in the current LLM usage patterns: (1) When you ask an LLM to generate a bio for you, it often combines factual statements with some made-up ones. (2) When you ask an LLM to summarize a given document (in the RAG style) it still can generate an incorrect summary (e.g. the work showing that 50% of book summaries contain factual errors x.com/lefthanddraft/…) (3) When you fine-tune an LLM all LLaMAI-style (e.g. x.com/rao2z/status/1…), it can improve the generation but doesn't completely avoid hallucinated completions. tldr; higher quality training data can improve the quality of completions, but doesn't guarantee factuality as it can't fuly eliminate the possibility of hallucination. In general, the n-gram nature of LLMs makes them inherently "creative" helping them mix and match content/patterns they drew from different parts of the corpora. This is their boon--and also bane. 👉x.com/rao2z/status/1… If factuality/correctness/truth is critical, you have to go LLM-Modulo external verifiers.. arxiv.org/abs/2402.01817 (x.com/rao2z/status/1…)
Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z

Pinning down Approximate Retrieval in LLMs (and, in the process making sense of that NYT law suit) The term approximate retrieval (that, afaik, I coined to provide a qualitative understanding of what LLMs do c.f. cacm.acm.org/blogs/blog-cac…), has caught on a bit. I will write down what I was trying to capture with the term--both because someone asked for a definition, and because it actually has some bearing on that NYT lawsuit!. 0. The "approximate" here is about whether the "retrieved" text is an unaltered copy of something that was stored (and not about whether the retrieval key is matched approximately) 1. Given that, underneath it all, LLMs are trained to be n-gram models (if only on steroids--aka ultra-large n), it should be rather non-controversial to say that they cannot guarantee exact retrieval. They are just a compact models of P(next token|Context window) In other words, with the n-gram model, the prompt is working as a "key" into the CPT rather than a key into any stored database. It is used to sample next token iteratively from the learned CPTs (with the context for the (n+1)th token affected by the specific sample selected as the n-th token!) 1.1 LLMs are not databases--and are not indexing and retrieving exactly matching records without altering them. The closest analogy to index is context and that is changing. There is certainly no stored record being retrieved. 1.2 LLMs are also not IR engines which, while doing similarity search (i.e., allowing approximate match with the key), still guarantee that what they give out is what was stored (IR doesn't make documents--it just retrieves documents that are similar to the query!). [Another way to see this is that if LLMs were just doing IR, then the ChatGPT essays can be caught by the old turnitin-style plagiarism detectors.] 1.2.1. The whole RAG rage can be understood as adding an external IR component to LLMs, where the prompt is used as an actual IR query on an external vector DB, and the stuff retrieved is added back into the prompt (hoping that LLM will summarize it..). See x.com/rao2z/status/1… 2. It is precisely this neither-DB-nor-IR nature of the n-gram model that gives LLMs their flexibility--of essentially capturing the distribution (manifold) of the text in the corpus (humorously illustrated by the tooth paste tube 👇metaphor that I had seen somewhere ) 3. Because of the way n-gram models work, there is never any 100% guarantee that some stored record (be it a program or an NY Times article) is retrieved unaltered. So why is NYT suing OpenAI? 3.1 However, with a long enough context window, and the network capacity, something close to memorization (aka "plagiarization") of long passages is very much possible (as is being shown in that NYT law suit!). 3.2. Interestingly generative ML systems effectively memorizing full passages/images has been observed in other generative models too--and can be interpreted as a failure to learn the distribution. See for example the old study by @prfsanjeevarora et. al. on whether GANs really learn the distribution/manifold or memorize parts of it. arxiv.org/abs/1706.08224 4. Commercial LLM makers (will) try to play both ends of the approximate retrieval to their advantage.. 4.1. When they try to argue NYT law suit, they will no doubt push on the fact that LLMs don't do exact retrieval and so there is no copyright infringement. 4.2 When they push LLMs for "search", they will try instead to bank on the memorization capabilities! The truth is that there is no 100% way to guarantee or stop either behavior! If LLM makers try to reduce memorization, they will certainly see that the LLM's ability to masquerade as search engines--already quite questionable (c.f. x.com/rao2z/status/1…) --will degrade even further (c.f. x.com/rao2z/status/1…)

English
11
33
193
36.7K
Dieter Castel retweetledi
Matt Enlow
Matt Enlow@CmonMattTHINK·
What proportion of quadratics have real roots?
English
42
5
125
55.2K
Dieter Castel
Dieter Castel@DieterCastel·
I'd be pro regulation mandating these being published.
English
0
0
0
62
Dieter Castel
Dieter Castel@DieterCastel·
The #GeminiAI paper (in)conveniently doesn't mention how long it trained nor the energy usage required. Anyone got more info on that?
English
1
0
0
150
Dieter Castel retweetledi
Bart Preneel
Bart Preneel@bpreneel1·
Who would have thought - ChatGPT's heartbleed moment
English
0
16
32
6.2K
Dieter Castel retweetledi
Tuta
Tuta@TutaPrivacy·
📢 BREAKING 📢 Historic agreement on #chatcontrol proposal: EU Parliament wants to remove chat control and safeguard secure encryption. 🔒 💪Let's keep pushing for strong privacy rights!👇 tuta.com/blog/chat-cont…
Tuta tweet media
English
5
56
181
14.6K
Dieter Castel retweetledi
Moshe Vardi
Moshe Vardi@vardi·
:-)
Moshe Vardi tweet media
ZXX
34
838
4.3K
276.1K
Dieter Castel retweetledi
Prakash
Prakash@8teAPi·
Vicious Self-Degradation > you Google > Quora spots query and id’s as frequent > Quora uses ChatGPT to generate answer > ChatGPT hallucinates > Google picks up Quora answer as highest probability correct answer > ChatGPT hallucination is now canonical Google answer
Prakash tweet mediaPrakash tweet media
English
170
2.5K
11.2K
2.4M
Dieter Castel retweetledi
Cliff Pickover
Cliff Pickover@pickover·
Mathematics. "What’s the area of the toppled square?" (All blocks are squares. The diagram is not to scale. The numbers represent areas of squares.) By Catriona Agg, @Cshearer41, Used with permission.
Cliff Pickover tweet media
English
25
45
241
72K
Dieter Castel retweetledi
LLM Security
LLM Security@llm_sec·
* People ask LLMs to write code * LLMs recommend imports that don't actually exist * Attackers work out what these imports' names are, and create & upload them with malicious payloads * People using LLM-written code then auto-add malware themselves vulcan.io/blog/ai-halluc…
English
77
2.1K
7.4K
1.8M
Dieter Castel
Dieter Castel@DieterCastel·
Twee vraagjes voor betersorteren.be @fostplusnl 1) Verstorven elastiekjes bij PMD? 2) Eenzijdige zilverpapiertjes zoals bij chocolaatjes zit restafval of PMD? (sommige zilverpapier mag bij rest anderre moet bij PMD :S)
Nederlands
0
0
0
87
Dieter Castel
Dieter Castel@DieterCastel·
I keep wondering, @katiesteckles, is the eurosong theme a boon or a hurdle for the MJ target audience? I'm def. ambivalent myself, but maybe you can recall from previous years. :-)
English
0
0
1
54
Dieter Castel
Dieter Castel@DieterCastel·
@xsteenbrugge The EU is not the one making it impossible. The Big Tech monopoly is... that's imho much more relevant in the field atm than this legislation. @jbaert @thomas_wint ?
English
0
0
0
247
Xander Steenbrugge
Xander Steenbrugge@xsteenbrugge·
The EU is about to pass legislation that'll make it impossible for generative AI startups to compete with large tech. Knowing how transformative this technology will be, I'm furious these decisions are made by legislators who simply dont fully understand what they are doing..
Jeremy Howard@jeremyphoward

"Any model made available in the EU, without first passing extensive, and expensive, licensing, would subject companies to massive fines of the greater of €20,000,000 or 4% of worldwide revenue. Opensource developers, and hosting services such as GitHub... would be liable"

English
17
71
289
133.5K
Dieter Castel
Dieter Castel@DieterCastel·
@xsteenbrugge There's much to say about the #GDPR, it's far from perfect. But failed? [source needed] It put #privacy on the map in an important way.
English
0
1
0
24
Xander Steenbrugge
Xander Steenbrugge@xsteenbrugge·
Just like the failed #GDPR legislation that was supposed to "protect EU citizens from big tech", this will simply create more barriers for innovation and severely impedes the capability to build new companies that can compete / be relevant on the global stage. WTF #EU?
English
3
4
33
2.7K
Dieter Castel
Dieter Castel@DieterCastel·
@xsteenbrugge Honest question: Where's the "small company competion" in ML right now? Even LARGE academic institutions can't compete with big tech atm. There's much more needed for healthy competition imho.
English
1
1
0
40
Dieter Castel
Dieter Castel@DieterCastel·
@cyrilzakka @jeremyphoward I think the same goes for many high risk fields. Current gen AI-models are often dangerously brittle. Take a look at my TL for some failures i posted of ChatGPT in december 2022.
English
0
0
0
67
Dieter Castel retweetledi
Cyril Zakka, MD
Cyril Zakka, MD@cyrilzakka·
@jeremyphoward In a way, I welcome this initiative for medical generative AI. The amount of poorly tested models for clinical AI I’ve seen out there is going to cause a lot of harm.
English
4
1
16
24.9K
Jeremy Howard
Jeremy Howard@jeremyphoward·
"Any model made available in the EU, without first passing extensive, and expensive, licensing, would subject companies to massive fines of the greater of €20,000,000 or 4% of worldwide revenue. Opensource developers, and hosting services such as GitHub... would be liable"
Technomancers_ai@technomancers

@MeetThePress @ericschmidt Fear reaction to what the EU is about to do. #more-561" target="_blank" rel="nofollow noopener">technomancers.ai/eu-ai-act-to-t…

English
82
216
1.2K
1.6M