Anupam Datta

275 posts

Anupam Datta

@datta_cs

AI @SnowflakeDB, Ex- Co-Founder/President/Chief Scientist @truera_ai, Ex-Prof @CarnegieMellon, Visiting Prof & PhD CS @Stanford, BTech @IITKgp

Katılım Aralık 2014

424 Takip Edilen1.5K Takipçiler

Sabitlenmiş Tweet

Anupam Datta@datta_cs·25 Eyl

What is your Agent's GPA or Goal-Plan-Action alignment? Observing that agent failures arise when their goals, plans, and actions are not aligned, we introduce a framework for evaluating and improving an agent’s GPA or Goal-Plan-Action alignment. Excited to have developed this course to share our learnings. Try it out hands-on and use the TruLens OSS project as you build and evaluate agents! Wonderful to collaborate with @_jreini and @AndrewYNg's @DeepLearningAI team on the course and with Allison Jia, Daniel Huang, Nikhil Vytla, Shayak Sen at @Snowflake, and John Mitchell at @Stanford on the research behind it. #agents #evals #trustworthyai

Andrew Ng@AndrewYNg

When data agents fail, they often fail silently - giving confident-sounding answers that are wrong, and it can be hard to figure out what caused the failure. "Building and Evaluating Data Agents" is a new short course created with @Snowflake and taught by @datta_cs and @_jreini that teaches you to build data agents with comprehensive evaluation built in. Skills you'll gain: - Build reliable LLM data agents using the Goal-Plan-Action framework and runtime evaluations that catch failures mid-execution - Use OpenTelemetry tracing and evaluation infrastructure to diagnose exactly where agents fail and systematically improve performance - Orchestrate multi-step workflows across web search, SQL, and document retrieval in LangGraph-based agents The result: visibility into every step of your agent's reasoning, so if something breaks, you have a systematic approach to fix it. Sign up to get started: deeplearning.ai/short-courses/…

English

696

Anupam Datta retweetledi

Łukasz Borchmann@LukaszBorchmann·13 Mar

1/10 Are agents navigating enterprise data strategically, or just stumbling until they get lucky? To answer this, we introduce MADQA, which benchmarks not just final answers but also search trajectories. A collab with @UniofOxford, @UNC, and @huggingface. 🧵

English

307

545.7K

Anupam Datta retweetledi

Noam Brown@polynoamial·28 Kas

Social media tends to frame AI debate into two caricatures: (A) Skeptics who think LLMs are doomed and AI is a bunch of hype. (B) Fanatics who think we have all the ingredients and superintelligence is imminent. But if you read what leading researchers actually say (beyond the headlines), there’s a surprising amount of convergence: 1) The current paradigm is likely sufficient for massive economic and societal impact, even without further research breakthroughs. 2) More research breakthroughs are probably needed to achieve AGI/ASI. (Continual learning and sample efficiency are two examples that researchers commonly point to.) 3) We probably figure them out and get there within 20 years. @demishassabis said maybe in 5-10 years. @fchollet recently said about 5 years. @sama said ASI is possible in a few thousand days. @ylecun said about 10 years. @ilyasut said 5-20 years. @DarioAmodei is the most bullish, saying it's possible in 2 years though he also said it might take longer. None of them are saying ASI is a fantasy, or that it's probably 100+ years away. A lot of the disagreement is in what those breakthroughs will be and how quickly they will come. But all things considered, people in the field agree on a lot more than they disagree on.

Ilya Sutskever@ilyasut

One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing.

English

231

544

4.1K

1.3M

Anupam Datta@datta_cs·25 Eyl

What is your Agent's GPA or Goal-Plan-Action Alignment? Observing that agent failures arise when their goals, plans, and actions are not aligned, we introduce a framework for evaluating and improving an agent’s GPA or Goal-Plan-Action alignment. Excited to have developed this course to share our learnings. Try it out hands-on and use the TruLens OSS project as you build and evaluate agents! Wonderful to collaborate with @_jreini and @AndrewYNg's @DeepLearningAI team on the course and with Allison Jia, Daniel Huang, Nikhil Vytla, Shayak Sen at @Snowflake, and John Mitchell at @Stanford on the research behind it.

English

266

Andrew Ng@AndrewYNg·24 Eyl

English

268

1.3K

101.3K

Anupam Datta@datta_cs·28 Tem

ACL 2025: 10X growth in submissions in the last 10 years, 4X in the last 5 years. Program just kicked off! #ACL2025

English

839

Anupam Datta retweetledi

Austin Vance@austinbv·16 May

We just wrapped up @langchain Interrupt, and here are my 10 key takeaways! 1️⃣ Agents are Here - I'm definitely riding a post-conference high. The energy was electric, and everyone was deeply engaged. The conference showcased real-world agent implementations happening now, such as AI SDRs, AI code writers, research agents, legal assistants, sales support, and more. These are agents in production at startups and enterprises, and not just digital natives. 2️⃣ Evals are Critical - Evals is the process of using fixture data to test an agent's performance with specific prompts. Every demo and discussion emphasized starting with Evals. This is a place where traditional design will truly shine. Designers are experts at capturing the human processes through journey mapping. This will be foundational to every agent. Evals are also a natural transition from TDD to a new process, Eval Driven Development (EDD). By embedding evals into the SDLC, running them at every commit, PR, and CI, developers gain the freedom to iterate confidently on code, agents, and prompts. 3️⃣ Architectures are Critical - Unlike traditional software, agent architectures cannot evolve haphazardly; they must be intentional from day one. Understanding how decisions, tool calls, and state are managed is essential. Migration between architectures is challenging, so starting with intentional, eval-backed designs is crucial. Particularly popular architectures are supervisor and swarm multi-agent setups, and sub-agent graphs. Supervisor architectures oversee multiple sub-agents, ensuring cohesive task execution and error management. Swarm architectures utilize many agents working in parallel, leveraging distributed intelligence to complete complex tasks efficiently. Sub-agent graphs define clear interactions and responsibilities among agents, essential for larger, interconnected tasks. Agents generally fall into two categories: Active Agents (like ChatGPT), requiring rapid, responsive interactions, and Ambient Agents, performing complex tasks in the background with a focus on accuracy and completeness. 4️⃣ Human-in-the-loop is essential - Thoughtfully integrating humans is more than oversight; it involves using human interactions to continuously improve agents through added memory and feedback loops. This transforms human roles from task execution to task management, elevating both human and agent effectiveness. 5️⃣ Think of agents as people, not systems - Companies using Ambient Agents often personalize them, assigning human-like roles and RBAC permissions instead of typical service-level integrations. 6️⃣ AI Driven Development is here to stay - AIDD is Revolutionizing and accelerating development processes. 7️⃣ Software surrounding agents is complex yet essential - The agent graph forms the application's core intelligence, but the real complexity lies in custom tool-building and integrations. These integrations, be it through MCP, model binding, or dedicated services, demand thoughtful decision-making regarding technology choices, deployment methods, and SDK or service approaches. These tools must be robust, observable, and thoroughly tested just like any traditional software. 8️⃣ Observability from day one is non-negotiable - LangSmith offers foundational visibility into agent workflows, critical given agents' probabilistic behaviors. Traditional debugging methods like print statements fall short; spans, traces, and comprehensive monitoring are essential. As observability evolves, LLMs will increasingly assist in alerting developers and operations to anomalies. Enhanced observability directly fuels more accurate evals and refined prompts. 9️⃣ Clear metrics are vital - Defining success and acceptable accuracy thresholds helps refine agents and demonstrates value. Unlike traditional software, an agent’s success is measured directly against human performance. 🔟 Agent Engineer/Designer the new job title - This emerging role merges prompting, product management, machine learning, software development, and UX journey mapping. Agent development requires great software. Effective prompting relies on understanding LLM reasoning, while practical ML skills enable fine-tuning and embedding models effectively. A product expertise ensures alignment with workflows and KPIs. We're more ready than ever to stay focused on agentic AI. You'll find us building smart, intentional agents that integrate seamlessly into our customers' systems and drive real results. The work starts now! Thanks @hwchase17 and team for a great conference!

English

30.5K

Anupam Datta retweetledi

Pushmeet Kohli@pushmeet·14 May

Excited to announce AlphaEvolve A powerful AI coding agent developed by our team in @GoogleDeepMind that is able to discover impactful new algorithms for important problems in Maths and Computing by combining the creativity of large language models with automated evaluators.

English

320

2.3K

203.2K

Anupam Datta retweetledi

Andrew Ng@AndrewYNg·14 May

New course: MCP: Build Rich-Context AI Apps with Anthropic. Learn to build AI apps that access tools, data, and prompts using the Model Context Protocol in this short course, created in partnership with Anthropic @AnthropicAI and taught by Elie Schoppik @eschoppik, its Head of Technical Education. Connecting AI applications to external systems that bring rich context to LLM-based applications has often meant writing custom integrations for each use case. MCP is an open protocol that standardizes how LLMs access tools, data, and prompts from external sources, and simplifies how you provide context to your LLM-based applications. For example, you can provide context via third-party tools that let your LLM make API calls to search the web, access data from local docs, retrieve code from a GitHub repo, and so on. MCP, developed by Anthropic, is based on a client-server architecture that defines the communication details between an MCP client, hosted inside the AI application, and an MCP server that exposes tools, resources, and prompt templates. The server can be a subprocess launched by the client that runs locally or an independent process running remotely. In this hands-on course, you'll learn the core architecture behind MCP. You’ll create an MCP-compatible chatbot, build and deploy an MCP server, and connect the chatbot to your MCP server and other open-source servers. Here’s what you’ll do: - Understand why MCP makes AI development less fragmented and standardizes connections between AI applications and external data sources - Learn the core components of the client-server architecture of MCP and the underlying communication mechanism - Build a chatbot with custom tools for searching academic papers, and transform it into an MCP-compatible application - Build a local MCP server that exposes tools, resources, and prompt templates using FastMCP, and test it using MCP Inspector - Create an MCP client inside your chatbot to dynamically connect to your server - Connect your chatbot to reference servers built by Anthropic’s MCP team, such as filesystem, which implements filesystem operations, and fetch, which extracts contents from the web as markdown - Configure Claude Desktop to connect to your server and others, and explore how it abstracts away the low-level logic of MCP clients - Deploy your MCP server remotely and test it with the Inspector or other MCP-compatible applications - Learn about the roadmap for future MCP development, such as multi-agent architecture, MCP registry API, server discovery, authorization, and authentication MCP is an exciting and important technology that lets you build rich-context AI applications that connect to a growing ecosystem of MCP servers, with minimal integration work. Please sign up here! deeplearning.ai/short-courses/…

English

363

2.1K

141.7K

Anupam Datta retweetledi

Casper Hansen@casper_hansen_·7 May

Almost a 5x speedup in vLLM🤯 I was able to push a finetuned Mistral Nemo from 110 tokens/s to a peak of 517 tokens/s and acceptance rate of 57.7%. This is with Suffix Decoding from ArcticInference⚡

English

249

21.8K

Anupam Datta retweetledi

Dwarak Rajagopal@dwarak·7 May

Exciting news! The PyTorch Foundation’s expansion with vLLM and DeepSpeed is a game-changer for open-source AI. Can’t wait to see the innovations this brings! As a premier member, Snowflake is excited to join the Board and help grow the OSS community. Big things ahead! 🚀 #PyTorch #AI #OpenSource @Snowflake

PyTorch@PyTorch

PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: hubs.la/Q03lmJNH0 #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed

English

1.1K

Anupam Datta retweetledi

Zhewei Yao@yao_zhewei·9 May

🚀 Big news! Our collab w/ Snowflake, UCSD & UMD topped the BIRD leaderboard — beating prior SOTA by 2.8% in Text-to-SQL reasoning! RL was tough, but worth it. 📢 Best model coming soon. #AI #LLM #TextToSQL #ReinforcementLearning #Snowflake #UCSD #UMD #NLP #BIRDLeaderboard

English

6.8K

Anupam Datta retweetledi

PyTorch@PyTorch·7 May

English

231

70.5K

Anupam Datta@datta_cs·2 May

Exciting result from Snowflake AI Research on speculative decoding. 4x faster LLM Inference for coding agents like @allhands_ai. Available in open source for you to play with. And take a look at the blog post by @aurickQ for details.

Aurick Qiao@aurickq

Excited to share our work on Speculative Decoding @Snowflake AI Research! 🚀 4x faster LLM inference for coding agents like OpenHands @allhands_ai 💬 2.4x faster LLM inference for interactive chat 💻 Open-source via Arctic Inference as a plugin for @vllm_project 🧵

English

255

Anupam Datta retweetledi

Andrej Karpathy@karpathy·25 Nis

Noticing myself adopting a certain rhythm in AI-assisted coding (i.e. code I actually and professionally care about, contrast to vibe code). 1. Stuff everything relevant into context (this can take a while in big projects. If the project is small enough just stuff everything e.g. `files-to-prompt . -e ts -e tsx -e css -e md --cxml --ignore node_modules -o prompt.xml`) 2. Describe the next single, concrete incremental change we're trying to implement. Don't ask for code, ask for a few high-level approaches, pros/cons. There's almost always a few ways to do thing and the LLM's judgement is not always great. Optionally make concrete. 3. Pick one approach, ask for first draft code. 4. Review / learning phase: (Manually...) pull up all the API docs in a side browser of functions I haven't called before or I am less familiar with, ask for explanations, clarifications, changes, wind back and try a different approach. 6. Test. 7. Git commit. Ask for suggestions on what we could implement next. Repeat. Something like this feels more along the lines of the inner loop of AI-assisted development. The emphasis is on keeping a very tight leash on this new over-eager junior intern savant with encyclopedic knowledge of software, but who also bullshits you all the time, has an over-abundance of courage and shows little to no taste for good code. And emphasis on being slow, defensive, careful, paranoid, and on always taking the inline learning opportunity, not delegating. Many of these stages are clunky and manual and aren't made explicit or super well supported yet in existing tools. We're still very early and so much can still be done on the UI/UX of AI assisted coding.

English

457

1.1K

12.3K

1.2M

Anupam Datta retweetledi

Weaviate AI Database@weaviate_io·25 Nis

Don’t debug with your eyes closed 👀 The Weaviate Query Agent is here to help you with all of your research tasks. Navigating through any number of collections, deciding whether to query or aggregate, taking the load off your shoulders when it comes to sifting through a maze of data. BUT, as is with any application, there’s always room to improve, and there’s always a need to see what’s happening behind the scenes. Queue the new TruLens integration. This integration wraps itself around our Query Agent with just a few lines of code and: - Let’s you decide how you want your agent to be evaluated - Adds logs and traces to your TruLens dashboard - Allows you to browse through the sources that were used to generate responses and pin point problems All there to let you tune your agent until you are happy with the results! 📚 Learn more about the integration: weaviate.io/developers/int… 🧑‍🍳 Get started with our new recipe, courtesy of our friends from Snowflake: github.com/weaviate/recip…

GIF

English

6.1K

Anupam Datta retweetledi

Anthropic@AnthropicAI·21 Nis

New Anthropic research: AI values in the wild. We want AI models to have well-aligned values. But how do we know what values they’re expressing in real-life conversations? We studied hundreds of thousands of anonymized conversations to find out.

English

284

1.9K

291.8K

Andrew Ng@AndrewYNg·17 Nis

I’ve noticed that many GenAI application projects put in automated evaluations (evals) of the system’s output probably later — and rely on humans to judge outputs longer — than they should. This is because building evals is viewed as a massive investment (say, creating 100 or 1,000 examples, and designing and validating metrics) and there’s never a convenient moment to put in that up-front cost. Instead, I encourage teams to think of building evals as an iterative process. It’s okay to start with a quick-and-dirty implementation (say, 5 examples with unoptimized metrics) and then iterate and improve over time. This allows you to gradually shift the burden of evaluations away from humans and toward automated evals. I wrote previously in The Batch about the importance and difficulty of creating evals. Say you’re building a customer-service chatbot that responds to users in free text. There’s no single right answer, so many teams end up having humans pore over dozens of example outputs with every update to judge if it improved the system. While techniques like LLM-as-judge are helpful, the details of getting this to work well (such as what prompt to use, what context to give the judge, and so on) are finicky to get right. All this contributes to the impression that building evals requires a large up-front investment, and thus on any given day, a team can make more progress by relying on human judges than figuring out how to build automated evals. I encourage you to approach building evals differently. It’s okay to build quick evals that are only partial, incomplete, and noisy measures of the system’s performance, and to iteratively improve them. They can be a complement to, rather than replacement for, manual evaluations. Over time, you can gradually tune the evaluation methodology to close the gap between the evals’ output and human judgments. For example: - It’s okay to start with very few examples in the eval set, say 5, and gradually add to them over time — or subtract them if you find that some examples are too easy or too hard, and not useful for distinguishing between the performance of different versions of your system. - It’s okay to start with evals that measure only a subset of the dimensions of performance you care about, or measure narrow cues that you believe are correlated with, but don’t fully capture, system performance. For example if, at a certain moment in the conversation, your customer-support agent is supposed to (i) call an API to issue a refund and (ii) generate an appropriate message to the user, you might start off measuring only whether or not it calls the API correctly and not worry about the message. Or if, at a certain moment, your chatbot should recommend a specific product, a basic eval could measure whether or not the chatbot mentions that product without worrying about what it says about it. So long as the output of the evals correlates with overall performance, it’s fine to measure only a subset of things you care about when starting. The development process thus comprises two iterative loops, which you might execute in parallel: - Iterating on the system to make it perform better, as measured by a combination of automated evals and human judgment; - Iterating on the evals to make them correspond more closely to human judgment. As with many things in AI, we often don’t get it right the first time. So t’s better to build an initial end-to-end system quickly and then iterate to improve it. We’re used to taking this approach to building AI systems. We can build evals the same way. To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B: - If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B. - If A and B have similar performance, their eval scores should be similar. Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly. This is a similar philosophy to error analysis in building machine learning algorithms, only instead of focusing on errors of the machine learning algorithm's output — such as when it outputs an incorrect label — we focus on “errors” of the evals — such as when they incorrectly rank two systems A and B, so the evals aren’t helpful in choosing between them. Relying purely on human judgment is a great way to get started on a project. But for many teams, building evals as a quick prototype and iterating to something more mature lets you put in evals earlier and accelerate your progress. [Original text: deeplearning.ai/the-batch/issu… ]

English

171

1.3K

207.9K

Anupam Datta@datta_cs·17 Nis

Very true @AndrewYNg. Starting with simple evals early in the dev cycle and gradually building up the depth and breadth of the evals is very powerful. We have observed this in the TruLens OSS project for evaluating LLM apps as well as in the process of building up various LLM apps at @SnowflakeDB to enable agents, text2sql, search and more. Often we start with a small ground truth dataset that we grow over time. In parallel, we build LLM Judges and refine their criteria (prompts). We have recently also had success with automatically optimizing the prompts of LLM Judges. snowflake.com/en/engineering…

English

1.7K

Anupam Datta retweetledi

sridhar@RamaswmySridhar·15 Nis

AI is not a bet—it’s a business imperative. 💰The average return on AI investments is $1.41 for every $1 invested. That number will only go up. I speak with customers every week—most teams have AI use cases they can execute right now. Here’s a look at what’s holding them back, what leaders need to do to win in AI and who’s thriving today with enterprise AI on Snowflake More on AI’s ROI in our latest research: snowflake.com/en/blog/gen-ai…

English

2.2K

Anupam Datta retweetledi

Hao AI Lab@haoailab·11 Nis

🚀 We are thrilled to release the code for ReFoRCE — a powerful Text-to-SQL agent with Self-Refinement, Format Restriction, and Column Exploration! 🥇 Ranked #1 on Spider 2.0 Leaderboard, a major step toward practical, enterprise-ready systems, tackled both: Spider 2.0-snow & Spider 2.0-lite subtasks! 🏆 Accepted to ICLR 2025 VerifAI Workshop! We look forward to seeing how our approach can advance the state of Text-to-SQL research! Please check the links for more details: 📄 Paper: arxiv.org/abs/2502.00675 💻 Code: github.com/hao-ai-lab/ReF… 📝 Blog: hao-ai-lab.github.io/blogs/reforce/

English

117

13.4K

Keşfet

@UniofOxford @UNC @huggingface @demishassabis @fchollet @sama @ylecun @ilyasut