Marcus Min

9 posts

Marcus Min

@marcusjmin

PhD Computer Science Student @Penn

가입일 Kasım 2022

47 팔로잉31 팔로워

고정된 트윗

Marcus Min@marcusjmin·18 Oca

🚨 #GPT4 doesn't understand the code/specification written by itself!? 🚨 🥳 Check out our #ICLR2024 paper "Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with ldentityChain" 🥳#LLM Paper: arxiv.org/abs/2310.14053 Code: github.com/marcusm117/Ide… [1/6]

English

Marcus Min 리트윗함

Baishakhi Ray@baishakhir·7 Haz

Introducing SemCoder, a semantic-aware Code LLM excelling in code generation and execution reasoning. Trained with high-quality data and novel way of aligning execution, only 6.7B model is outperforming GPT3.5 and CodeLlama 34B. link: arxiv.org/pdf/2406.01006 #LLMs, #AI4Code

English

8.6K

Marcus Min@marcusjmin·18 Oca

Huge congrats and many thanks to my co- authors 🎉🎉🎉 @RobinDing3 @lucaburatti7 @saurabh2288 Prof. Gail Kaiser Prof. Suman Jana Prof. @baishakhir

English

208

Marcus Min@marcusjmin·18 Oca

English

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir IdentityChain evaluates the NL-to-PL Accuracy, PL-to-NL Accuracy, and Self-Consistency of a model at the same time. Model developers and users can use it to pinpoint particular weaknesses of their models. We demonstrate 3 weaknesses found in current models using IdentityChian.

English

181

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [6/6] To show the efficiency of IdentityChain, we leverage Greedy Decoding. We show empirically that most Self-Consistency violations can be exposed within the first 3 steps even though we chose 5 steps for our experiments.

English

150

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [5/6] To show the effectiveness of IdentityChain, we compared our metric Test Output Match (TOM) score with existing metrics. Our TOM score has the highest correlation with human-judged ground truth.

English

131

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [4/6] We evaluated 11 recent models including GPT-4, showing that their performance dropped up to 78% compared to conventional accuracy evaluation. We observe that models with similar Conventional Accuracy can have very different Self-Consistency (GPT-4 v.s. GPT-3.5).

English

154

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [3/6] We formalized the concepts of Self-Consistency and Self-Consistency Evaluation.

English

122

Marcus Min@marcusjmin·18 Oca

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [2/6] Current evaluations of LLMs test the models on a wide range of tasks individually, while overlooking the relation across them: if a trustworthy model performs NL-to-PL Generation correctly, it should do PL-to-NL Generation correctly. We call such property Self-Consistency.

English

176

탐색

@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir @elonmusk @BarackObama @taylorswift13 @cristiano