Dieter Castel

6.5K posts

Dieter Castel

@DieterCastel

Engineer, ex-@NVISOsecurity, Alumnus CompSci @CW_KULeuven Math/STEAM/ML/@julialangu Enthusiast, Traceur, Multi-Genre music lover. #MathsJam @stadleuven cohost.

Leuven - Belgium Katılım Şubat 2014

2.2K Takip Edilen418 Takipçiler

Sabitlenmiş Tweet

Dieter Castel@DieterCastel·18 Ara

My plead for using #privacy friendly communication is now already available in 3 languages: Nederlands 🇳🇱, Español 🇪🇸 & English 🇬🇧 dietercastel.com/2019/11/28/res…

English

Dieter Castel retweetledi

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z·21 Nis

No. Training LLMs on purely factual data STILL WON'T cure them of "Hallucinations" #SundayHarangue There is a persistent myth that LLM hallucinations are just a result of them being trained on un-curated and "non-factual" data, and will go away with high quality/factual data. This misses the basic n-gram structure of LLMs. Yes, the presence of "non-factual" training data does increase the chance of producing "non-factual" completions. But, even if you train LLMs only on factual data (and I will suspend my disbelief for a minute about the impossibility of doing that in a multi-polar world), LLMs can and will still continue to produce completions that are not factual! A simplistic way to visualize it is this: Imagine you have access to a 1000 curated wikipedia documents. Don't you think that by selectively cutting pasting from those documents, you can generate an inaccurate/not-fully-factual new one? This happens because LLMs are completing the prompt probabilistically conditioned on the training corpus ("approximate retrieval") rather than indexing and retrieving like (the boring and much maligned) databases! (See x.com/rao2z/status/1…; quoted below). The fact that factuality of the training data is not sufficient to avoid hallucinations is demonstrated in multiple ways in the current LLM usage patterns: (1) When you ask an LLM to generate a bio for you, it often combines factual statements with some made-up ones. (2) When you ask an LLM to summarize a given document (in the RAG style) it still can generate an incorrect summary (e.g. the work showing that 50% of book summaries contain factual errors x.com/lefthanddraft/…) (3) When you fine-tune an LLM all LLaMAI-style (e.g. x.com/rao2z/status/1…), it can improve the generation but doesn't completely avoid hallucinated completions. tldr; higher quality training data can improve the quality of completions, but doesn't guarantee factuality as it can't fuly eliminate the possibility of hallucination. In general, the n-gram nature of LLMs makes them inherently "creative" helping them mix and match content/patterns they drew from different parts of the corpora. This is their boon--and also bane. 👉x.com/rao2z/status/1… If factuality/correctness/truth is critical, you have to go LLM-Modulo external verifiers.. arxiv.org/abs/2402.01817 (x.com/rao2z/status/1…)

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z

Pinning down Approximate Retrieval in LLMs (and, in the process making sense of that NYT law suit) The term approximate retrieval (that, afaik, I coined to provide a qualitative understanding of what LLMs do c.f. cacm.acm.org/blogs/blog-cac…), has caught on a bit. I will write down what I was trying to capture with the term--both because someone asked for a definition, and because it actually has some bearing on that NYT lawsuit!. 0. The "approximate" here is about whether the "retrieved" text is an unaltered copy of something that was stored (and not about whether the retrieval key is matched approximately) 1. Given that, underneath it all, LLMs are trained to be n-gram models (if only on steroids--aka ultra-large n), it should be rather non-controversial to say that they cannot guarantee exact retrieval. They are just a compact models of P(next token|Context window) In other words, with the n-gram model, the prompt is working as a "key" into the CPT rather than a key into any stored database. It is used to sample next token iteratively from the learned CPTs (with the context for the (n+1)th token affected by the specific sample selected as the n-th token!) 1.1 LLMs are not databases--and are not indexing and retrieving exactly matching records without altering them. The closest analogy to index is context and that is changing. There is certainly no stored record being retrieved. 1.2 LLMs are also not IR engines which, while doing similarity search (i.e., allowing approximate match with the key), still guarantee that what they give out is what was stored (IR doesn't make documents--it just retrieves documents that are similar to the query!). [Another way to see this is that if LLMs were just doing IR, then the ChatGPT essays can be caught by the old turnitin-style plagiarism detectors.] 1.2.1. The whole RAG rage can be understood as adding an external IR component to LLMs, where the prompt is used as an actual IR query on an external vector DB, and the stuff retrieved is added back into the prompt (hoping that LLM will summarize it..). See x.com/rao2z/status/1… 2. It is precisely this neither-DB-nor-IR nature of the n-gram model that gives LLMs their flexibility--of essentially capturing the distribution (manifold) of the text in the corpus (humorously illustrated by the tooth paste tube 👇metaphor that I had seen somewhere ) 3. Because of the way n-gram models work, there is never any 100% guarantee that some stored record (be it a program or an NY Times article) is retrieved unaltered. So why is NYT suing OpenAI? 3.1 However, with a long enough context window, and the network capacity, something close to memorization (aka "plagiarization") of long passages is very much possible (as is being shown in that NYT law suit!). 3.2. Interestingly generative ML systems effectively memorizing full passages/images has been observed in other generative models too--and can be interpreted as a failure to learn the distribution. See for example the old study by @prfsanjeevarora et. al. on whether GANs really learn the distribution/manifold or memorize parts of it. arxiv.org/abs/1706.08224 4. Commercial LLM makers (will) try to play both ends of the approximate retrieval to their advantage.. 4.1. When they try to argue NYT law suit, they will no doubt push on the fact that LLMs don't do exact retrieval and so there is no copyright infringement. 4.2 When they push LLMs for "search", they will try instead to bank on the memorization capabilities! The truth is that there is no 100% way to guarantee or stop either behavior! If LLM makers try to reduce memorization, they will certainly see that the LLM's ability to masquerade as search engines--already quite questionable (c.f. x.com/rao2z/status/1…) --will degrade even further (c.f. x.com/rao2z/status/1…)

English

193

36.7K

Dieter Castel retweetledi

Matt Enlow@CmonMattTHINK·26 Oca

What proportion of quadratics have real roots?

English

125

55.2K

Dieter Castel@DieterCastel·20 Ara

I stumbled upon these lovely @PlutoJL notebooks #Mathematics" target="_blank" rel="nofollow noopener">featured.plutojl.org/#Mathematics and I think they would fit nicely in the @explorables collection as well. :-)

English

109

Dieter Castel@DieterCastel·8 Ara

I'd be pro regulation mandating these being published.

English

Dieter Castel@DieterCastel·8 Ara

The #GeminiAI paper (in)conveniently doesn't mention how long it trained nor the energy usage required. Anyone got more info on that?

English

150

Dieter Castel retweetledi

Bart Preneel@bpreneel1·29 Kas

Who would have thought - ChatGPT's heartbleed moment

English

6.2K

Dieter Castel retweetledi

Tuta@TutaPrivacy·14 Kas

📢 BREAKING 📢 Historic agreement on #chatcontrol proposal: EU Parliament wants to remove chat control and safeguard secure encryption. 🔒 💪Let's keep pushing for strong privacy rights!👇 tuta.com/blog/chat-cont…

English

181

14.6K

Dieter Castel retweetledi

Moshe Vardi@vardi·13 Kas

:-)

ZXX

838

4.3K

276.1K

Dieter Castel retweetledi

Prakash@8teAPi·26 Eyl

Vicious Self-Degradation > you Google > Quora spots query and id’s as frequent > Quora uses ChatGPT to generate answer > ChatGPT hallucinates > Google picks up Quora answer as highest probability correct answer > ChatGPT hallucination is now canonical Google answer

English

170

2.5K

11.2K

2.4M

Dieter Castel retweetledi

Cliff Pickover@pickover·12 Ağu

Mathematics. "What’s the area of the toppled square?" (All blocks are squares. The diagram is not to scale. The numbers represent areas of squares.) By Catriona Agg, @Cshearer41, Used with permission.

English

241

72K

Dieter Castel retweetledi

LLM Security@llm_sec·10 Haz

* People ask LLMs to write code * LLMs recommend imports that don't actually exist * Attackers work out what these imports' names are, and create & upload them with malicious payloads * People using LLM-written code then auto-add malware themselves vulcan.io/blog/ai-halluc…

English

2.1K

7.4K

1.8M

Dieter Castel@DieterCastel·2 Haz

Twee vraagjes voor betersorteren.be @fostplusnl 1) Verstorven elastiekjes bij PMD? 2) Eenzijdige zilverpapiertjes zoals bij chocolaatjes zit restafval of PMD? (sommige zilverpapier mag bij rest anderre moet bij PMD :S)

Nederlands

Dieter Castel@DieterCastel·19 May

I keep wondering, @katiesteckles, is the eurosong theme a boon or a hurdle for the MJ target audience? I'm def. ambivalent myself, but maybe you can recall from previous years. :-)

English

Dieter Castel@DieterCastel·19 May

Next week tuesday @ 20:00 in OPEK café @stadleuven. The monthly #Leuven #MathsJam. See u there? Below a Eurosong theme teaser flyer \/. :-)

English

161

Dieter Castel@DieterCastel·18 May

vrt.be/vrtnws/nl/2023…

ZXX

Dieter Castel@DieterCastel·16 May

@xsteenbrugge The EU is not the one making it impossible. The Big Tech monopoly is... that's imho much more relevant in the field atm than this legislation. @jbaert @thomas_wint ?

English

247

Xander Steenbrugge@xsteenbrugge·15 May

The EU is about to pass legislation that'll make it impossible for generative AI startups to compete with large tech. Knowing how transformative this technology will be, I'm furious these decisions are made by legislators who simply dont fully understand what they are doing..

Jeremy Howard@jeremyphoward

"Any model made available in the EU, without first passing extensive, and expensive, licensing, would subject companies to massive fines of the greater of €20,000,000 or 4% of worldwide revenue. Opensource developers, and hosting services such as GitHub... would be liable"

English

289

133.5K

Dieter Castel@DieterCastel·16 May

@xsteenbrugge There's much to say about the #GDPR, it's far from perfect. But failed? [source needed] It put #privacy on the map in an important way.

English

Xander Steenbrugge@xsteenbrugge·15 May

Just like the failed #GDPR legislation that was supposed to "protect EU citizens from big tech", this will simply create more barriers for innovation and severely impedes the capability to build new companies that can compete / be relevant on the global stage. WTF #EU?

English

2.7K

Dieter Castel@DieterCastel·16 May

@xsteenbrugge Honest question: Where's the "small company competion" in ML right now? Even LARGE academic institutions can't compete with big tech atm. There's much more needed for healthy competition imho.

English

Dieter Castel@DieterCastel·16 May

@cyrilzakka @jeremyphoward I think the same goes for many high risk fields. Current gen AI-models are often dangerously brittle. Take a look at my TL for some failures i posted of ChatGPT in december 2022.

English

Dieter Castel retweetledi

Cyril Zakka, MD@cyrilzakka·15 May

@jeremyphoward In a way, I welcome this initiative for medical generative AI. The amount of poorly tested models for clinical AI I’ve seen out there is going to cause a lot of harm.

English

24.9K

Jeremy Howard@jeremyphoward·15 May

Technomancers_ai@technomancers

@MeetThePress @ericschmidt Fear reaction to what the EU is about to do. #more-561" target="_blank" rel="nofollow noopener">technomancers.ai/eu-ai-act-to-t…

English

216

1.2K

1.6M

Keşfet

@PlutoJL @explorables @Cshearer41 @katiesteckles @stadleuven @xsteenbrugge @jbaert @thomas_wint