Negar Arabzadeh

415 posts

Negar Arabzadeh

Negar Arabzadeh

@NegarEmpr

Postdoc at @UCBerkeley Sky Lab | Interested in Information Retrieval | 👩🏻‍💻Prev : @google, @MSFTResearch, @SpotifyResearch 📚:@UWaterloo

Berkeley, USA Katılım Nisan 2017
1K Takip Edilen1.4K Takipçiler
Negar Arabzadeh retweetledi
Atoosa Kasirzadeh
Atoosa Kasirzadeh@Dr_Atoosa·
Today is Nowruz, one of humanity's oldest New Year celebrations, alive across multiple cultures. #Nowruz reminds us that spring comes regardless, the trees bloom despite bombs, and life renews. In Iran, it marks the beginning of the Persian New Year. Despite the incessant bombings, full internet blackout, those Iranians who can are carrying on with their traditions: to live is to resist. Nowruz is celebrated with a traditional display called the Haft Sin (meaning “Seven S’s”), a ceremonial table set with seven symbolic items, each beginning with the Persian letter “sin” (س). These items represent hopes and blessings for the new year: 1.Sabzeh (سبزه): Sprouted wheat or lentils, symbolizing rebirth and renewal. 2.Samanu (سمنو): A sweet wheat pudding, symbolizing abundance and prosperity. 3.Senjed (سنجد): Dried oleaster fruit, symbolizing love and affection. 4.Seer (سیر): Garlic, symbolizing health and medicine. 5.Seeb (سیب): Apples, symbolizing beauty and good health. 6.Somaq (سماق): Sumac berries, symbolizing the color of sunrise and patience. 7.Serkeh (سرکه): Vinegar, symbolizing age, wisdom, and tolerance. #IranWar
Atoosa Kasirzadeh tweet media
English
16
181
791
14.3K
Negar Arabzadeh retweetledi
Paniz
Paniz@Panizachi·
Iranians across the world right now: #Iran
English
17
788
3.1K
75.1K
Negar Arabzadeh retweetledi
Niloofar
Niloofar@niloofar_mire·
My feed rn
Niloofar tweet media
English
2
2
67
4.9K
Negar Arabzadeh retweetledi
Jimmy Lin
Jimmy Lin@lintool·
Help Me Choose (HMC) represents the first production deployment of the LLM council concept popularized by @karpathy and others - available on @yupp_ai for you to try! We wrote up a short blurb that I'll be presenting at the #WSDM2026 Industry Track: dl.acm.org/doi/10.1145/37…
Jimmy Lin@lintool

Today, we are launching “Help Me Choose” in @yupp_ai – a new product feature where multiple AIs critique each other and debate among themselves to help users synthesize diverse perspectives and get the best answer out of their own “AI council”.

English
4
14
45
6.7K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
“I wish I had seen this before the submission deadline.”☹️ Missing related work can seriously hurts... Try our open-source #deepresearch system, DeepScholar-base, on your #ICML submission and make sure you’re not missing anything! github.com/guestrin-lab/d…
English
1
0
2
360
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
This Friday, @melissapan and I are giving a talk on measuring agents in production 🚀 If you’re building agents, don’t miss this! 🕘 Fri 9–10pm EST 📄 MAP: Measuring Agents in Production arxiv.org/abs/2512.04123
NICE AI Talk@academic_nice

🤩 NICE Talk 127 ⭐️#Al Agents Through 20+ Real-World Case Studies⭐️ 📌 Stream it live — no app needed, click register and watch: luma.com/cfezxymd 🧐 How to turn AI agents into real-world production-level systems? ⚠️6️⃣8️⃣% of agents fail after 10 steps without #human intervention ⚙️ ⚠️7️⃣0️⃣% rely on #prompts, rather than fine-tuning 📄 ⚠️7️⃣4️⃣% are #evaluated solely by humans 🙇‍♂️ 🎤 Invited Speaker: Melissa Z. Pan, PhD in UC Berkeley. "Efficient Agents & Composite AI systems." 🎤 Invited Speaker: Negar Arabzadeh (@NegarEmpr), PostDoc at UC Berkeley. "Let the same LLM be both player and evaluator." 🎙️ Host: Haolun Wu (@Haolun_Wu0203), PhD in Mila & McGill. "Trustworthy AI systems." Talk Begin Time ⏰ Pacific Time: 2026.1.23 (Fri) 18:00 ⏰ USA Eastern Standard Time: 2026.1.23 (Fri) 21:00 ⏰ Beijing Time: 2026.1.24 (Sat) 10:00 📌 YouTube livestream and summaries: youtube.com/live/hcQmCWzwX… 🙌 Measuring #Agents in #Production! 🥳 This talk will present #research on current #industry practices, highlighting real-world challenges in production environments and offering practitioners proven strategies from successful #case studies, bridging the gap between academic research and practical implementation.

English
0
4
12
1.8K
Negar Arabzadeh retweetledi
Matei Zaharia
Matei Zaharia@matei_zaharia·
I'm thrilled to be co-organizing a new ACM research Conference on Agentic & AI Systems (CAIS) this spring! It's the first conference focused solely on this field. 🌐caisconf.org 📍San Jose, May 26-29, 2026 🗓️Paper deadline Feb 27th Follow @CAISconf for updates!
Matei Zaharia tweet media
English
6
65
369
35.8K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
I tried StringSight to analyze and compare model outputs especially for long, messy thinking traces across different models. It made it much easier to see where models diverge and why failures happen If you’re working with long reasoning traces, give it a try!
Lisa Dunlap@lisabdunlap

🧵Tired of scrolling through your horribly long model traces in VSCode to figure out why your model failed? We made StringSight to fix this: an automated pipeline for analyzing your model outputs at scale. ➡️Demo: stringsight.com ➡️Blog: blog.stringsight.com

English
1
0
7
982
Negar Arabzadeh retweetledi
Negar Arabzadeh retweetledi
Yichuan Wang
Yichuan Wang@YichuanM·
(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!
GIF
English
5
53
173
63.6K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
In your #agentic workflow or #RAG pipeline you need a suitable query reformulator! Try QueryGym🏋️plug in, benchmark, and iterate on query reformulation with ease. 💡 Includes SOTA reformulators, plug-and-play setup 🔌 Easy integration into different pipelines #RAG #Agents #LLMs
Amin Bigdeli@amin_bigdelii

Introducing #QueryGym 🏋️. A lightweight, reproducible toolkit for LLM-based query reformulation in RAG, agents, and conversational search. 🚀 Install: 𝗽𝗶𝗽 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 𝗾𝘂𝗲𝗿𝘆𝗴𝘆𝗺 📝arxiv.org/pdf/2511.15996 🔗 github.com/ls3-lab/QueryG… #LLMs #RAG #Agents #NLP #AI

English
1
0
17
1.6K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
🎉 Thrilled to share that our paper "Adversarial Attacks against Neural Ranking Models via In-Context Learning" has been shortlisted for Best Paper Award @ACMSIGIR_AP #SIGIR
Negar Arabzadeh tweet media
English
4
3
31
1.7K
Negar Arabzadeh retweetledi
DAIR.AI
DAIR.AI@dair_ai·
First large-scale study of AI agents actually running in production. The hype says agents are transforming everything. The data tells a different story. Researchers surveyed 306 practitioners and conducted 20 in-depth case studies across 26 domains. What they found challenges common assumptions about how production agents are built. The reality: production agents are deliberately simple and tightly constrained. 1) Patterns & Reliability - 68% execute at most 10 steps before requiring human intervention. - 47% complete fewer than 5 steps. - 70% rely on prompting off-the-shelf models without any fine-tuning. - 74% depend primarily on human evaluation. Teams intentionally trade autonomy for reliability. Why the constraints? Reliability remains the top unsolved challenge. Practitioners can't verify agent correctness at scale. Public benchmarks rarely apply to domain-specific production tasks. 75% of interviewed teams evaluate without formal benchmarks, relying on A/B testing and direct user feedback instead. 2) Model Selection The model selection pattern surprised researchers. 17 of 20 case studies use closed-source frontier models like Claude Sonnet 4, Claude Opus 4.1, and GPT o3. Open-source adoption is rare and driven by specific constraints: high-volume workloads where inference costs become prohibitive, or regulatory requirements preventing data sharing with external providers. For most teams, runtime costs are negligible compared to the human experts the agent augments. 3) Agent Frameworks Framework adoption shows a striking divergence. 61% of survey respondents use third-party frameworks like LangChain/LangGraph. But 85% of interviewed teams with production deployments build custom implementations from scratch. The reason: core agent loops are straightforward to implement with direct API calls. Teams prefer minimal, purpose-built scaffolds over dependency bloat and abstraction layers. 4) Agent Control Flow Production architectures favor predefined static workflows over open-ended autonomy. 80% of case studies use structured control flow. Agents operate within well-scoped action spaces rather than freely exploring environments. Only one case allowed unconstrained exploration, and that system runs exclusively in sandboxed environments with rigorous CI/CD verification. 5) Agent Adoption What drives agent adoption? It's simply the productivity gains. 73% deploy agents primarily to increase efficiency and reduce time on manual tasks. Organizations tolerate agents taking minutes to respond because that still outperforms human baselines by 10x or more. 66% allow response times of minutes or longer. 6) Agent Evaluation The evaluation challenge runs deeper than expected. Agent behavior breaks traditional software testing. Three case study teams report attempting but struggling to integrate agents into existing CI/CD pipelines. The challenge: nondeterminism and the difficulty of judging outputs programmatically. Creating benchmarks from scratch took one team six months to reach roughly 100 examples. 7) Human-in-the-loop Human-in-the-loop evaluation dominates at 74%. LLM-as-a-judge follows at 52%, but every interviewed team using LLM judges also employs human verification. The pattern: LLM judges assess confidence on every response, automatically accepting high-confidence outputs while routing uncertain cases to human experts. Teams also sample 5% of production runs even when the judge expresses high confidence. In summary, production agents succeed through deliberate simplicity, not sophisticated autonomy. Teams constrain agent behavior, rely on human oversight, and prioritize controllability over capability. The gap between research prototypes and production deployments reveals where the field actually stands. Paper: arxiv.org/abs/2512.04123 Learn design patterns and how to build real-world AI agents in our academy: dair-ai.thinkific.com
DAIR.AI tweet media
English
55
227
1.2K
285.4K
Negar Arabzadeh retweetledi
Mathew Jacob
Mathew Jacob@mat_jacob1002·
For improving RAG pipelines, it doesn't always have to be bigger = better. Make sure to visit the #NeurIPS ML4Sys workshop and learn from @melissapan and @NegarEmpr how to achieve 20x lower energy usage while maintaining high RAG quality!
Melissa Pan@melissapan

I am around the ML for Systems workshop @ NeurIPS today ⚙️ Looking forward to chatting and sharing more about our work Electro ⚡️ Also happy to chat about our new paper MAP 🗺️ or our neurips work MAST ⛵️

English
1
2
7
1.7K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
Have you ever wondered how much you can optimize your RAG pipeline? 🤔 We show that you can get the same performance with up to 20× less energy ⚡️ Come check out our poster today at #NeurIPS2025 ML for systems Workshop to see how far you can push RAG efficiency! 🔋
Melissa Pan@melissapan

I am around the ML for Systems workshop @ NeurIPS today ⚙️ Looking forward to chatting and sharing more about our work Electro ⚡️ Also happy to chat about our new paper MAP 🗺️ or our neurips work MAST ⛵️

English
0
2
6
1.2K
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
🚨 Today at the MATH AI workshop @NeurIPSConf ! 🚨 Are you considering if RAG works for reasoning-intensive tasks like math? Come visit our poster this morning to see how corpus reconstruction can gains of up to 30.3% on benchmarks like MATH500! #NeurIPS2025
Negar Arabzadeh tweet media
English
3
3
25
1.2K