Jonathan Katzy

19 posts

Jonathan Katzy

Jonathan Katzy

@katzy_jonathan

PhD student, Multilingual LLMs for Software Engineering @serg_delft @AISE_TUDelft

Katılım Mayıs 2022
51 Takip Edilen40 Takipçiler
Jonathan Katzy retweetledi
Ali Al-Kaswan 🍉
Ali Al-Kaswan 🍉@aalkaswan1·
📣 Exciting competition for the AI & Software Engineering communities! 🎯 Large language models (LLMs) for code are powerful but they often memorize what they’ve seen. When evaluation datasets overlap with training data, reported performance can be misleading
English
1
1
1
29
Jonathan Katzy
Jonathan Katzy@katzy_jonathan·
🎯 Impact: Enables reliable LLM performance assessment without the data cleaning overhead. Perfect for researchers who want to focus on evaluation, not data preprocessing! Languages covered: Assembly to Mathematica, C to Haskell, Python to Rust 🌍
English
2
0
2
57
Jonathan Katzy
Jonathan Katzy@katzy_jonathan·
🧵Introducing "The Heap" - A contamination-free code dataset for fair LLM evaluation! 🔍 The Problem: 90% of code LLM studies don't deduplicate evaluation data, leading to inflated performance metrics @MalihehIzadi @razvanmp27 @avandeursen Thread 👇
English
1
3
4
298
Jonathan Katzy retweetledi
Maliheh (Mali) Izadi
Maliheh (Mali) Izadi@MalihehIzadi·
@katzy_jonathan presenting our #FORGE24 paper on “An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets”! Talk to him if U are interested in the data used for training a wide range of #llm4code and their implications! @ConfForge
Maliheh (Mali) Izadi tweet media
English
0
4
10
529
Jonathan Katzy
Jonathan Katzy@katzy_jonathan·
We found that even datasets that claim to contain only code from permissively licensed repositories have a large amount of code also attributed to strong copyleft licenses. Raising the question who should be responsible for adherence to code licenses in public online datasets
English
0
1
3
285
Jonathan Katzy
Jonathan Katzy@katzy_jonathan·
We analyzed the prevalence of exact duplicates of code files with strong copyleft licensed code in public code datasets, and searched the comments of code files to identify strong copyleft licenses.
English
1
1
3
334
Jonathan Katzy
Jonathan Katzy@katzy_jonathan·
Are foundation models vulnerable to lawsuits regarding the data they used to train? We analyzed the state of LLMs wrt licenses in public code datasets. Checkout the preprint: arxiv.org/abs/2403.15230 and discuss the implications with me at FORGE/@ICSEconf in April. @MalihehIzadi
English
1
3
11
1.9K
Jonathan Katzy retweetledi
AISE-TUDelft
AISE-TUDelft@AISE_TUDelft·
🎆We're happy to share that 2 full papers from our lab, have been accepted at the International Conference on AI Foundation Models and Software Engineering, co-located with @ICSEconf cc @katzy_jonathan, Razvan, Arie, Tim, Frank, Philippe, Berend, Marc, & @MalihehIzadi #FORGE 1/n
AISE-TUDelft tweet media
English
1
3
16
916
Jonathan Katzy retweetledi
Maliheh (Mali) Izadi
Maliheh (Mali) Izadi@MalihehIzadi·
Heartfelt thanks to Prem @devanbu for spending an enlightening week w @serg_delft & delivering a thought-provoking lecture on quality of code generated by #LLMs in our course (ML4SE'23)! I am sure your expertise added a rich dimension to the learning experience of our students.
Maliheh (Mali) Izadi tweet media
English
2
7
35
2.3K