Marcus Min

9 posts

Marcus Min

Marcus Min

@marcusjmin

PhD Computer Science Student @Penn

가입일 Kasım 2022
47 팔로잉31 팔로워
Marcus Min 리트윗함
Baishakhi Ray
Baishakhi Ray@baishakhir·
Introducing SemCoder, a semantic-aware Code LLM excelling in code generation and execution reasoning. Trained with high-quality data and novel way of aligning execution, only 6.7B model is outperforming GPT3.5 and CodeLlama 34B. link: arxiv.org/pdf/2406.01006 #LLMs, #AI4Code
English
1
10
82
8.6K
Marcus Min
Marcus Min@marcusjmin·
@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir IdentityChain evaluates the NL-to-PL Accuracy, PL-to-NL Accuracy, and Self-Consistency of a model at the same time. Model developers and users can use it to pinpoint particular weaknesses of their models. We demonstrate 3 weaknesses found in current models using IdentityChian.
Marcus Min tweet mediaMarcus Min tweet mediaMarcus Min tweet media
English
0
0
1
181
Marcus Min
Marcus Min@marcusjmin·
@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [6/6] To show the efficiency of IdentityChain, we leverage Greedy Decoding. We show empirically that most Self-Consistency violations can be exposed within the first 3 steps even though we chose 5 steps for our experiments.
Marcus Min tweet media
English
0
0
1
150
Marcus Min
Marcus Min@marcusjmin·
@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [5/6] To show the effectiveness of IdentityChain, we compared our metric Test Output Match (TOM) score with existing metrics. Our TOM score has the highest correlation with human-judged ground truth.
Marcus Min tweet media
English
0
0
1
131
Marcus Min
Marcus Min@marcusjmin·
@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [4/6] We evaluated 11 recent models including GPT-4, showing that their performance dropped up to 78% compared to conventional accuracy evaluation. We observe that models with similar Conventional Accuracy can have very different Self-Consistency (GPT-4 v.s. GPT-3.5).
Marcus Min tweet media
English
0
0
1
154
Marcus Min
Marcus Min@marcusjmin·
@RobinDing3 @lucaburatti7 @saurabh2288 @baishakhir [2/6] Current evaluations of LLMs test the models on a wide range of tasks individually, while overlooking the relation across them: if a trustworthy model performs NL-to-PL Generation correctly, it should do PL-to-NL Generation correctly. We call such property Self-Consistency.
Marcus Min tweet media
English
0
0
1
176