Jonathan Katzy

19 posts

Jonathan Katzy

@katzy_jonathan

PhD student, Multilingual LLMs for Software Engineering @serg_delft @AISE_TUDelft

Katılım Mayıs 2022

51 Takip Edilen40 Takipçiler

Jonathan Katzy retweetledi

FSE 2026@FSEconf·24 Oca

Closed-source LLMs hide training data. The Poisoned Chalice competition asks: was this code file in the model’s training set? Build membership-inference methods, compete, and present at FSE/AIWare 2026 (Montréal). razvain.github.io/miacomp/ #LLM4Code #FSE2026 #MembershipInference

English

626

Jonathan Katzy retweetledi

Ali Al-Kaswan 🍉@aalkaswan1·9 Oca

📣 Exciting competition for the AI & Software Engineering communities! 🎯 Large language models (LLMs) for code are powerful but they often memorize what they’ve seen. When evaluation datasets overlap with training data, reported performance can be misleading

English

Jonathan Katzy@katzy_jonathan·14 Tem

📄 Paper: ieeexplore.ieee.org/abstract/docum… 📊 Dataset: huggingface.co/datasets/AISE-…

Filipino

Jonathan Katzy@katzy_jonathan·14 Tem

🎯 Impact: Enables reliable LLM performance assessment without the data cleaning overhead. Perfect for researchers who want to focus on evaluation, not data preprocessing! Languages covered: Assembly to Mathematica, C to Haskell, Python to Rust 🌍

English

Jonathan Katzy@katzy_jonathan·14 Tem

🧵Introducing "The Heap" - A contamination-free code dataset for fair LLM evaluation! 🔍 The Problem: 90% of code LLM studies don't deduplicate evaluation data, leading to inflated performance metrics @MalihehIzadi @razvanmp27 @avandeursen Thread 👇

English

298

Jonathan Katzy retweetledi

PROMISE Conference@promise_conf·26 Haz

Great insights from @katzy_jonathan! Jonathan Katzy from @serg_delft @AISE_TUDelft just presented "A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics". Exciting work on #LLM #CodeComments #SoftwareEngineering

English

Jonathan Katzy retweetledi

Maliheh (Mali) Izadi@MalihehIzadi·14 Nis

@katzy_jonathan presenting our #FORGE24 paper on “An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets”! Talk to him if U are interested in the data used for training a wide range of #llm4code and their implications! @ConfForge

English

529

Jonathan Katzy@katzy_jonathan·25 Mar

We found that even datasets that claim to contain only code from permissively licensed repositories have a large amount of code also attributed to strong copyleft licenses. Raising the question who should be responsible for adherence to code licenses in public online datasets

English

285

Jonathan Katzy@katzy_jonathan·25 Mar

We analyzed the prevalence of exact duplicates of code files with strong copyleft licensed code in public code datasets, and searched the comments of code files to identify strong copyleft licenses.

English

334

Jonathan Katzy@katzy_jonathan·25 Mar

Are foundation models vulnerable to lawsuits regarding the data they used to train? We analyzed the state of LLMs wrt licenses in public code datasets. Checkout the preprint: arxiv.org/abs/2403.15230 and discuss the implications with me at FORGE/@ICSEconf in April. @MalihehIzadi

English

1.9K

Jonathan Katzy retweetledi

AISE-TUDelft@AISE_TUDelft·22 Şub

@ICSEconf @katzy_jonathan @MalihehIzadi resulted in identifying 38M exact duplicates in our strong copyleft dataset. This indicates the pervasive issue of #license inconsistencies in #LLMs4Code. We call upon the community to prioritize developing and adopting best practices for #legal & #ethical dataset creation... 3/n

English

411

Jonathan Katzy retweetledi

AISE-TUDelft@AISE_TUDelft·22 Şub

@ICSEconf @katzy_jonathan @MalihehIzadi 1. "An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets" 📊 w/ @katzy_jonathan, Razvan Popescu, Arie van Deursen, & @MalihehIzadi Our deep dive into 514 million code files used for training various #LLMs trained on code... 2/n

English

336

Jonathan Katzy retweetledi

AISE-TUDelft@AISE_TUDelft·22 Şub

🎆We're happy to share that 2 full papers from our lab, have been accepted at the International Conference on AI Foundation Models and Software Engineering, co-located with @ICSEconf cc @katzy_jonathan, Razvan, Arie, Tim, Frank, Philippe, Berend, Marc, & @MalihehIzadi #FORGE 1/n

English

916

Jonathan Katzy retweetledi

Maliheh (Mali) Izadi@MalihehIzadi·21 Ara

At @AISE_TUDelft, we're happy to wrap up the year with 2 papers accepted at @ICSEconf (ResearchTrack)🎉 -Language Models for Code Completion: A Practical Evaluation -Traces of Memorisation in #LLMs for Code @aalkaswan1, @katzy_jonathan, @timvandamdev, Marc, Razvan & @avandeursen

English

2.1K

Jonathan Katzy retweetledi

Maliheh (Mali) Izadi@MalihehIzadi·7 Eyl

Heartfelt thanks to Prem @devanbu for spending an enlightening week w @serg_delft & delivering a thought-provoking lecture on quality of code generated by #LLMs in our course (ML4SE'23)! I am sure your expertise added a rich dimension to the learning experience of our students.

English

2.3K

Jonathan Katzy@katzy_jonathan·29 Ağu

Why do #LLMs perform better on some programming tasks than others? In our #IEEESCAM2023 paper, we show early evidence for an inherent difference in representations of tokens between languages. arxiv.org/abs/2308.13354 @MalihehIzadi @avandeursen @ieeescam #ML #LLM #ML4SE

English

2.1K

Keşfet

@MalihehIzadi @razvanmp27 @avandeursen @serg_delft @AISE_TUDelft @ConfForge @ICSEconf @aalkaswan1