GenBench

193 posts

GenBench

@GenBench

State-of-the-art generalisation testing in NLP. Tag us for a RT of your NLP generalisation paper tweet!

Entrou em Nisan 2022

15 Seguindo436 Seguidores

Tweet fixado

GenBench@GenBench·2 May

The GenBench workshop is back! Do you work on generalisation (benchmarking) in #NLProc? Submit to the 2nd edition (genbench.org/workshop/) co-located with #EMNLP2024. We have a regular track and a ✨collaborative benchmarking task (CBT)✨ that's fully LLM-focused this year (1/6)

English

12.6K

GenBench@GenBench·17 Kas

@robinomial @mrdrozdov @_dieuwke_ @najoungkim @kylelostat @sameer_ Two independently arrived at but similar conclusions 😁

English

Robin Jia@robinomial·17 Kas

@GenBench @mrdrozdov @_dieuwke_ @najoungkim @kylelostat @sameer_ Interesting, my first thought is that overfitting is a subset of reward hacking 😅 overfitting is hacking the supervised learning “reward function” but the reward function could be different (and have more degenerate solutions)

English

262

Dieuwke Hupkes@_dieuwke_·16 Kas

Not EMNLP'd out yet? Join the @GenBench workshop on generalisation in NLP today! 🤩 genbench.org/workshop/ Location: Brickell

English

2.5K

GenBench@GenBench·17 Kas

That's a wrap! We (@glnmario, @christos_c, @_dieuwke_, @vernadankers, @khuyagbaatar_b, @a_kazemnejad & @ryandcotterell) thank all presenters, authors, reviewers and attendees!! The keynotes, the cats 😻, the posters, the talks and the lively panel: it was fantastic👏 🔥

English

2.9K

GenBench@GenBench·17 Kas

@mrdrozdov @_dieuwke_ @najoungkim @kylelostat @sameer_ @robinomial We're discussing, our initial response: reward hacking is a subset of overfitting, but also, what do you mean with reward hacking? 😁

English

154

GenBench@GenBench·17 Kas

@mrdrozdov @_dieuwke_ @najoungkim @kylelostat @sameer_ @kylelostat @sameer_ @robinomial any thoughts? 😁

English

125

GenBench retweetou

Najoung Kim 🫠@najoungkim·17 Kas

so proud of @HayleyRossLing for getting a best paper award at @GenBench this year!! 🎉🪅🎉 I'm sure @TeaAnd_OrCoffee would be too :) check out our paper and share if you think homemade cats are cats!

Hayley Ross@HayleyRossLing

New paper with @najoungkim and @TeaAnd_OrCoffee testing if LLMs can draw adjective-noun inferences like humans! Turns out they often can, and even generalize to unseen combinations. But they're more optimistic about "artificial intelligence" than humans. arxiv.org/abs/2410.17482

English

3.5K

GenBench retweetou

Kanishka Misra 🌊@kanishkamisra·17 Kas

Woohoo go tinlab! Congrats @HayleyRossLing @TeaAnd_OrCoffee @najoungkim!!

GenBench@GenBench

Best paper!

English

1.3K

GenBench@GenBench·17 Kas

Congratulations!

Najoung Kim 🫠@najoungkim

English

236

GenBench@GenBench·17 Kas

Congrats to all the authors!

English

GenBench@GenBench·17 Kas

Best paper!

English

1.4K

GenBench@GenBench·17 Kas

Closing remarks and best paper award by @vernadankers

English

906

GenBench@GenBench·17 Kas

And we also have an honourable mention!

English

103

GenBench@GenBench·17 Kas

Come listen to the hot takes of our panelist in the Brickell room! Do we still need generalisation evaluation? 🧐 #GenBench2024 #EMNLP2024

English

1.5K

GenBench@GenBench·16 Kas

Still at the poster session? Come join us for keynote 3 by @sameer_!

English

738

GenBench@GenBench·16 Kas

Did you miss the GenBench poster session? Don't worry we've got you, here are (nearly all) posters! 😉 #GenBench2024 #EMNLP2024 Next up: keynote by Sameer Singh at 3!

English

830

GenBench@GenBench·16 Kas

Last spotlight presentation: MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models aclanthology.org/2024.genbench-… Unfortunately the authors couldn't make it, the work is kindly presented by their colleague Hengyi Wang 🙏

English

GenBench@GenBench·16 Kas

Continuing with Bastian Bunzeck, presenting The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns aclanthology.org/2024.genbench-…