Hendrik Schuff

42 posts

Hendrik Schuff

Hendrik Schuff

@HendrikSchuff

Senior Data Scientist at @Zurich Working on human-centered AI Previous: Postdoc at @UKPLab, TU Darmstadt, PhD at @bosch_ai and @ims_stuttgart https://t.co/oICxRf7D1B

Katılım Kasım 2016
199 Takip Edilen177 Takipçiler
Hendrik Schuff retweetledi
UKP Lab
UKP Lab@UKPLab·
LLMs are increasingly prompted with different user profiles to solve subjective NLP tasks. What are the factors which determine what the model generates? Discover it in our #EACL2024 paper – learn more in this 🧵 (1/8). 📰 arxiv.org/abs/2309.07034 #NLProc #Prompting
UKP Lab tweet media
English
1
5
32
3.8K
Hendrik Schuff retweetledi
UKP Lab
UKP Lab@UKPLab·
We're thrilled to invite you to be part of a unique project, stemming from a master's student's thesis at our Lab. Introducing SignalGPT: …nalgpt.ukp.informatik.tu-darmstadt.de A chat platform similar to #ChatGPT. Our aim? Delve into how users interact with AI-driven chat apps. (1/🧵) #NLProc
UKP Lab tweet media
English
2
8
30
5.7K
Hendrik Schuff
Hendrik Schuff@HendrikSchuff·
@yoavgo We investigated this for explainability and analyzed the HotpotQA leaderboard. We found initial evidence that single-number benchmarks can gradually loose their validity, i.e., follow Goodhart's law, probably by overfitting: arxiv.org/abs/2210.07126 (in 4.1.3 + more in 5.3)
Hendrik Schuff tweet media
English
0
1
7
2.5K
(((ل()(ل() 'yoav))))👾
single-number benchmarks that include many tasks may be simple to use and highly adopted, but also pretty much guarantee you will optimize and arbitrary and very likely suboptimal metric.
(((ل()(ل() 'yoav))))👾 tweet media
Jason Wei@_jasonwei

Moving from Google Brain to OpenAI, one of the biggest changes for me was the shift from doing individual/small-group research to working on a team with several dozen people. Specifically, working on a bigger team has led me to think more about UX for researchers. Some examples: 1. Great tooling accelerates research. Subpar tools hamper researchers by introducing unnecessary friction into thinking and analysis. Even small improvements like reducing clicks and scrolls can significantly increase researcher's productivity. Visualizations become particularly vital when working with multi-task models, helping to better evaluate tradeoffs between different models. 2. Simple design is key for a the success of an evaluation benchmark. For example, GLUE/SuperGLUE, as well as MMLU/GSM8K have a single number (higher is better), and everyone wants it to go up. They are easy to understand, download, and evaluate. Other benchmarks (e.g., BIG-Bench, probably one of the great benchmarks of the past two years IMHO) can have advantages such as much broader coverage, but are basically impossible to run and a pain in the ass to analyze. For Google's PaLM paper, I heard one engineer's full-time job was just to run BIG-Bench... 3. Strong documentation enables scaling communication without involvement. Imagine if you have to chat with someone to explain how something works. They have to wait for you to reply, and you have to stop your work to message them. This takes up two people's time. With good documentation, you don't have to be involved at all, and the other person doesn't have to wait for your responses, which accelerates both people a lot.

English
5
6
81
25.2K
Hendrik Schuff retweetledi
UKP Lab
UKP Lab@UKPLab·
A warm welcome to @HendrikSchuff, who has just started his postdoc at UKP Lab! 👋 Hendrik's research focuses on the explainability and human-centred evaluation of #NLProc systems. You can find out more about him on his personal website: hendrikschuff.de
UKP Lab tweet media
English
1
1
19
867
Hendrik Schuff
Hendrik Schuff@HendrikSchuff·
The takeaways are: Communicating importance with word heatmaps carries many unexpected biases, even from other words in the sentence. Our results question whether words are good units for heatmaps, and help understand where things can go wrong. 7/7
English
0
0
2
151
Hendrik Schuff
Hendrik Schuff@HendrikSchuff·
This paper also confirms our previous paper's results in a reproduction study, which shows just how robust these biases are in different text domains (we replicate effects for word length, capitalization, dependency relation and display index) 6/7
Hendrik Schuff tweet media
English
1
0
2
192
Hendrik Schuff
Hendrik Schuff@HendrikSchuff·
Our paper provides a brief introduction to the topic and focuses on applications and examples from NLP. We discuss various stages of conducting user studies including experimental designs, levels of measurement, crowdsourcing, and choosing appropriate statistical tests. 3/3
Hendrik Schuff tweet media
English
1
0
2
130
Hendrik Schuff
Hendrik Schuff@HendrikSchuff·
Many NLP systems cannot be evaluated using proxy scores alone and require an (additional) human-centered evaluation. However, planning, conducting and evaluating user studies can be overwhelming for researchers getting started with human evaluation. 2/3
Hendrik Schuff tweet media
English
1
0
2
145