Ebtesam retweetledi
Ebtesam
35 posts

Ebtesam
@ebtesamdotpy
AI/SE Research | CS PhD @GeorgeMasonU | Prev @MSFTResearch
Washington, DC Katılım Ekim 2021
199 Takip Edilen114 Takipçiler
Ebtesam retweetledi
Ebtesam retweetledi

As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (arxiv.org/abs/2504.05185) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training.
I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem.
What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong).
So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness.
In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.

English
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi

New post re: Devin (the AI SWE). We couldn't find many reviews of people using it for real tasks, so we went MKBHD mode and put Devin through its paces.
We documented our findings here. Would love to know if others have had a different experience.
answer.ai/posts/2025-01-…

English
Ebtesam retweetledi

Long overdue, a paper finally exposes the Emperor's New “Threats to Validity” Clothes in empirical software engineering research. Even better, it provides suggestions for improving the state of practice.


Prof. Per Runeson@SoftEngResGrp
Presenting our paper @ESEM_conf soon: Threats to Validity in Software Engineering – hypocritical paper section or essential analysis? Paper #OpenAccess dl.acm.org/doi/10.1145/36…
English
Ebtesam retweetledi

It's common to add personas in system prompts, assuming this can help LLMs. However, through analyzing 162 roles x 4 LLMs x 2410 questions, we show that adding a persona mostly has *no* statistically significant difference from the no-persona setting. If there is a difference, it is *negative*. It's time to rethink the usage of personas in system prompts!
Mingqian Zheng@elisazmq_zheng
🎙️ What if the way we prompt LLMs might actually hold it back? 🚨 Assigning personas like "helpful assistant" in system prompts might *not* be as helpful as we think! ✨ Check out our work accepted to Findings of @emnlpmeeting ✨ 📜 arxiv.org/abs/2311.10054 🧵 [1/7]
English
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi

🚨 Inclusive tech research alert! 🚨
Are you a tech user who identifies as BIPOC (bit.ly/BIPOC_defined)? Or a researcher/practitioner who uses data in your work?
Share your experiences in our 20 min. survey→go.gmu.edu/EngagingTheMar…
IRBNet #: 1945546-2
#data #tech #trust
English
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi
Ebtesam retweetledi

Happy birthday to Python creator Guido van Rossum. The open source language was named after comedy troupe Monty Python: bit.ly/2B8R7h6
Image v/Midjourney

English
Ebtesam retweetledi

When I got started with programming, I debugged using printf() statements. Today, I debug with print() statements.
The purpose of debugging is to correct your mental model of what your code does, and no tool can do that for you. The best any tool can do is provide visibility into code execution, and targeted print statements already do a tremendous job at that.
MIT CSAIL@MIT_CSAIL
“The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.” — Brian Kernighan, co-creator of Unix
English














