André Balleyguier (@andrebalg) - โปรไฟล์ Twitter

André Balleyguier รีทวีตแล้ว

Anthropic@AnthropicAI·2d

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

1.9K

6.5K

43K

29.6M

André Balleyguier@andrebalg·30 Mar

@justrikkotbh @cgtwts Judging the fastest growing product in the history of humanity (because of how good it is!) based on its recent instability because of spiking demand, resulting in a 98%+ availability, is really peak-delulu! 🤯

English

0

225

rikko@justrikkotbh·29 Mar

@cgtwts my honest reaction to that information

English

3

7

193

10.4K

CG@cgtwts·29 Mar

Anthropic CEO: “ I have engineers within anthropic who don’t write any code, they just let Claude write the code and they edit it and look it over” “At anthropic writing code means designing the next version of Claude it self, so we essentially have Claude designing the next version of Claude itself, not completely but most of it”. In the last 52 days, the Claude team dropped 50+ major feature launches. This is literally INSANE.

Claude@claudeai

Your work tools in Claude are now available on mobile. Explore Figma designs, create Canva slides, check Amplitude dashboards, all from your phone. Give it a try: claude.com/download

English

497

737

8.8K

2.3M

André Balleyguier@andrebalg·8 Mar

@aarondotdev Did you read the paper? It studied junior devs *learning* a new library. The ones who used AI to understand concepts scored just as well as the no-AI group. The ones who copy-pasted blindly didn't. The conclusion isn't "AI is bad", it's that you need "think while you use AI".

English

1

0

578

aaron@aarondotdev·8 Mar

Anthropic themselves found that vibecoding hinders SWEs ability to read, write, debug, and understand code. not only that, but AI generated code doesn’t result in a statistically significant increase in speed don’t let your managers scare you into increased productivity. show them this paper straight from Anthropic.

English

217

615

6.5K

2.5M

André Balleyguier@andrebalg·7 Mar

@erroneous_input @ns123abc @sama The pentagon agreed to the terms of Anthropic (i.e "no use for fully autonomous or mass surveillance of Americans") but they added the redline "except bulk data analysis" which is the same as saying they can use it for anything, because any AI system does bulk data analysis...

English

0

13

erroneous input@erroneous_input·5 Mar

@ns123abc @sama i’m not sure i understand here. you said that he claimed they agreed to all terms except bulk data analyses. the fact he would reject that is unhinged. but besides that, how can it be claimed OAI agreed to palantir’s offer if the only holdout was bulk data?

English

1

0

9

1.3K

NIK@ns123abc·5 Mar

🚨BREAKING: ANTHROPIC CEO JUST ENDED OPENAI @sama After getting blacklisted by Pentagon, Dario sits down and writes the most unhinged CEO memo in silicon valley history: >calls openai's pentagon deal "safety theater" >says trump admin hates them because they haven't "given dictator-style praise to Trump (while Sam has)" >names greg brockman's $25M trump super PAC donation by name says they supported AI regulation >"which is against their agenda" >says they "told the truth about AI policy issues like job displacement" THE PALANTIR EXPOSÉ: >reveals palantir's actual pitch to anthropic during negotiations >"you have some unhappy employees, you need to offer them something that placates them or makes what is happening invisible to them, and that's the service we provide" >palantir's pitch wasn't safety. it was CONCEALMENT >palantir offered a "classifier" to detect red line violations >dario: models get jailbroken, monitoring only works in a few cases "maybe 20% real and 80% safety theater" >says palantir offered openai the same package >openai accepted it >says Altman is "peddling narratives" to his own employees >calls openai employees "sort of a gullible bunch" due to "selection effects" >says the "attempted spin/gaslighting" isn't working on >the public or media but IS working on "some Twitter morons" rofl >says his main concern is making sure it doesn't work on openai employees too BTW near the end of negotiations the pentagon offered to accept ALL of anthropic's terms if they deleted ONE phrase: >"analysis of bulk acquired data" >anthropic refused >same surveillance clause pentagon said they didn't even want to do >meanwhile Altman told his employees: "you don't get to weigh in on that" 💀 ITS OVER. ANTHROPIC WON, DEAL WITH IT

English

382

1.1K

10.3K

1.1M

André Balleyguier@andrebalg·7 Mar

@vdub12 @ns123abc @sama AI is not a military weapon! AI could *support* a much bigger system to create fully autonomous weapons... But clearly Anthropic does not have access to all these other components.

English

1

0

12

Winston Smith@vdub12·5 Mar

@ns123abc @sama This entire thing has been political from the very beginning. Anthropic is lying because they don't like Trump. The Pentagon on never asked for what anthropic claims they did. They simply don't think a private company should have a kill switch on military weapons platforms.

English

4

1

9

1.9K

André Balleyguier@andrebalg·6 Mar

@vdub12 @JoelStransky @ns123abc @sama If you think the AI is a military weapons system you are very wrong. The AI can support the creation of autonomous military weapons, but is far from being a weapon on its own. It would be the same as saying a bullet is a weapon if you have no gun.

English

1

0

9

Winston Smith@vdub12·5 Mar

@JoelStransky @ns123abc @sama What does that have to do with what I said? The Pentagon offered them a contract but did not want them to retain control of a military weapons system. It's not like Boeing builds aircraft but then the military has to ask permission to use the keys.

English

1

0

32

André Balleyguier รีทวีตแล้ว

vitalik.eth@VitalikButerin·24 Şub

It will significantly increase my opinion of @Anthropic if they do not back down, and honorably eat the consequences. (For those who are not aware, so far they have been maintaining the two red lines of "no fully autonomous weapons" and "no mass surveillance of Americans". Actually a very conservative and limited posture, it's not even anti-military. IMO fully autonomous weapons and mass privacy violation are two things we all want less of, so in my ideal world anyone working on those things gets access to the same open-weights LLMs as everyone else, and exactly nothing on top of that. Of course we won't get anywhere close to that world, but if we get even 10% closer to that world that's good, and if we get 10% further that's bad) CC @DarioAmodei firefly.social/post/bsky/pv7f…

English

333

321

3.2K

362K

André Balleyguier รีทวีตแล้ว

Alex Ratner@ajratner·26 Oca

Excited to share an example of the many projects we're driving @SnorkelAI around enterprise-specific environments and benchmarks - including detail on: - Domain-specific, enterprise env & tool development - Persona simulation for multi-turn eval - Nuanced rubrics & more!

Snorkel AI@SnorkelAI

We’re heading to #AAAI2026! Our accepted paper “Benchmarking Agents in Insurance Underwriting Environments” will be presented Jan 26 at the AAAI workshop. Stop by to learn more—@amanda_dsouza will be there to chat! Learn more about our benchmark: snorkel.ai/blog/building-…

English

1

6

18

2.7K

André Balleyguier รีทวีตแล้ว

Alex Ratner@ajratner·12 Oca

Excited to share a preview of @SnorkelAI 's new Agentic Coding benchmark - testing models on realistic, multi-step software engineering tasks in fully sandboxed execution environments across a calibrated range of task domains and difficulties, inspired by our work with the @terminalbench team! With a top pass@5 score of 58% (Opus 4.5) - this new benchmark challenges the notion running wild on X right now that LLMs have "solved" software engineering. And, with both unit tests and both final-output and trajectory-level rubrics, it's already giving us & partners insights into where coding agents fail. Excited to share more here shortly! Link to benchmark & release post in 🧵👇

English

1

12

35

2.6K

André Balleyguier@andrebalg·6 Oca

@nicolasmelo @Yuchenj_UW Anthropic valuation is 350Bn after the last Azure/Nvidia investment, trending to 500Bn according to some sources. Their growth YoY is also higher.

English

0

36

nicolasmelo@nicolasmelo·6 Oca

@Yuchenj_UW OAI Valuation is 830 Billion, almost 1 Trillion Anthropic valuation is 183 Billion anthropic.com/news/anthropic… Both strategies makes sense OpenAI wants to gather marketshare, anything AI related you think about OAI Anthropic wants enterprises, not retail users, but professionals

English

1

0

2

285

Yuchen Jin@Yuchenj_UW·6 Oca

OpenAI and Anthropic have opposite cultures. OpenAI runs like a modern Bell Labs. 2-3 researchers spin up projects like GPT & Sora, then turn them into products. Maximal ambition, from each kind of model to robotics to AI device. Anthropic is brutally focused. They believe coding is the path to AGI. Everything else is noise. No image models. No video models. No vagueposts. It will be fascinating to see which one wins.

English

334

332

8.2K

2.1M

André Balleyguier@andrebalg·4 Oca

@businessbarista Claude

English

0

6

Alex Lieberman@businessbarista·3 Oca

I want to start a community dedicated to Claude Code. It’s become the gateway drug to coding and experiencing the power of AI for tons of people. This will be a space for people to share killer use cases, agentic workflows, proven prompts, and connect with other CC obsessives. Comment “Claude” if you want to join.

English

7.1K

202

6.3K

620.2K

André Balleyguier@andrebalg·10 Ara

@emollick I find Opus 4.5 superior as it creates high quality Google Slides / PowerPoint that I can modify directly. Having to update prompts to iterate is a nightmare.

English

0

39

Ethan Mollick@emollick·10 Ara

I did not expect that the PowerPoint killer would be something called Nano Banana Pro, but that is where its heading It makes the major efforts by all the other AI companies, including Microsoft, to crack PowerPoint by using python seem like a dead end ImageGen is all you need?

English

112

121

1.9K

152.3K

André Balleyguier@andrebalg·20 Kas

@krassenstein The ignorance in the comments of this post is what makes this post so good! Good job!!!

English

0

7

Brian Krassenstein@krassenstein·19 Kas

BREAKING: Zohran Mamdani is expected to require ALL New York Elementary school students to learn Arabic numerals. As a Jewish American I still support this 100%

English

33K

6.5K

107.4K

19.7M

André Balleyguier รีทวีตแล้ว

anshuman@athleticKoder·24 Eyl

Don't say: "vLLM is fastest" or "Ollama is easiest." Wrong framing. The real answer isn't about features - it's about matching serving philosophy to your constraints. Local prototype vs. production scale vs. complex workflows = completely different frameworks.

English

1

24

5.5K

André Balleyguier รีทวีตแล้ว

anshuman@athleticKoder·13 Eyl

I rejected a job offer yesterday. Not because of the salary. Not because of the tech stack. Not even because of the long hours they warned me about. But because, when I asked how they evaluate their AI systems, the hiring manager said: "We just ask it some questions and see if the answers sound right." I stared at them for a moment and realized... They just described the biggest problem in AI today. See, "sounds right" isn't a measurement. It's a hope. Here's what proper LLM evaluation actually looks like: - Accuracy: Can it get factual questions right? (Not 80% of the time. Consistently.) - Hallucination rate: How often does it make things up? (This should be near zero for critical applications.) - Bias metrics: Does it treat all groups fairly? (Measured across demographics, not assumed.) Real Evaluation Frameworks: - BLEU scores for translation quality Perplexity for language modeling Human evaluation with inter-annotator agreement Adversarial testing (red teaming) Domain-specific benchmarks (legal, medical, financial) The Process: > Define success criteria BEFORE deployment > Create diverse test sets (not just happy paths) > Measure consistently across model versions > Track performance over time (models drift) Have humans validate edge cases Why This Matters: Before proper evals: "Our model is amazing!" (based on cherry-picked examples) After proper evals: "Our AI achieves 94.2% accuracy on domain X, with known failure modes Y and Z" The difference? One builds trust. The other destroys it when reality hits. The kicker: Most companies are still in the "sounds right" phase. They're deploying models evaluated by vibes, not metrics. Just like you wouldn't join a team that deploys code without tests, you shouldn't join one that deploys AI without proper evaluation. What's your experience with LLM evaluation? Are we measuring what actually matters?

English

99

141

1.8K

283.2K

André Balleyguier รีทวีตแล้ว

Amanda Dsouza@amanda_dsouza·7 Ağu

Tested gpt-oss-120b on @SnorkelAI 's agent benchmarks - performance seems split on real world tasks. It did well on multi-step Finance, but poorly on multi-turn Insurance. Think this could be a limitation of either domain knowledge or handling multi-turn scenarios.

English

1

7

25

1.5K

André Balleyguier รีทวีตแล้ว

Alex Ratner@ajratner·12 Haz

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with @SnorkelAI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point: scale alone is not enough. AI success is all about the quality, complexity, and distribution of data—in addition to volume. We’re excited to be powering leading LLM developers with @SnorkelAI Expert Data-as-a-Service, our white glove service for custom, expert-level AI datasets—and to now preview some of what we’re building via our new Expert Data Leaderboard (🔗 in 🧵) + upcoming OSS dataset releases! Snorkel Expert Data-as-a-Service is built to meet the rapidly evolving data needs of the agentic AI world—where success is built on the quality, complexity, and distribution of datasets, in addition to size and scale. This kind of high-quality, frontier AI data can only come from a union of technology and human expertise. With Snorkel Expert Data-as-a-Service, we’re powering frontier LLM developers across agentic, expert knowledge, reasoning, coding, multi-modal, and other task types via the combination of these two key components: - (1) The Snorkel Expert Network: A global team of subject matter experts focused wholly on specialized knowledge–spanning thousands of topics in STEM/academic, vertical/professional, and consumer/lifestyle domains. - (2) @SnorkelAI Data Development Platform: Our unique programmatic data curation and quality control platform, accelerating and improving expert authoring and review through principled techniques developed over the last decade of R&D. Now: we’re incredibly excited to showcase some of the power of Snorkel Expert Data-as-a-Service via the new Snorkel Leaderboard—putting frontier models to the test in complex, agentic, and reasoning settings inspired by real industry scenarios (not esoteric puzzles)! We’ll be releasing new leaderboards and accompanying expert-verified open source datasets (coming soon!) regularly. To start, we’re sharing three initial ones in preview: - SnorkelFinance: Q&A over financial documents requiring agentic tool-calling and reasoning - SnorkelUnderwrite: Agentic insurance tasks requiring industry-specific reasoning and tool use - SnorkelSequences: Mathematical tasks requiring compositional multi-step reasoning

English

14

31

143

495.8K

André Balleyguier รีทวีตแล้ว

Delip Rao e/σ@deliprao·29 May

It’s been amazing to watch @ajratner build and grow @SnorkelAI from a neurips presentation and an open source repo to one of the highly valuable AI companies building impactful product features for data labeling. Very timely that they are bringing in that expertise to agentic workflows! If you still wondering why you need data in this world of high performing models, see the next tweet.

Alex Ratner@ajratner

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! --- Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough. Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more! Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value. If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! snorkel.ai/demo/ Finally, see thread for details on 🧵👇 - 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task - 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast @Stanford @QBE & others - 📊 An upcoming series of benchmark datasets and model artifact releases 👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!

English

3

4

26

3.2K

André Balleyguier รีทวีตแล้ว

Percy Liang@percyliang·29 May

High quality data is key. Excited to work with Snorkel to improve our Marin models!

Alex Ratner@ajratner

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! --- Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough. Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more! Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value. If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! snorkel.ai/demo/ Finally, see thread for details on 🧵👇 - 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task - 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast @Stanford @QBE & others - 📊 An upcoming series of benchmark datasets and model artifact releases 👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!

English

2

7

64

8.2K

André Balleyguier@andrebalg·30 May

@ajratner @SnorkelAI @lisarivalin @weballergy Take a look!

English

0

1

100

Alex Ratner@ajratner·29 May

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! --- Snorkel Evaluate is our new data-centric agentic AI evaluation platform for specialized, mission-critical enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are not enough. Snorkel Expert Data-as-a-Service is our white glove service for expert-level AI datasets, powering frontier LLM developers in areas like expert knowledge, reasoning, agentic action and tool use, and more! Both built on top of @SnorkelAI’s Data Development Platform, using our programmatic technology to drive higher-quality expert data, faster– for getting specialized AI to real production value. If you’re building enterprise AI and want to partner around the key ingredient in AI today–the data–book a demo and let's talk! snorkel.ai/demo/ Finally, see thread for details on 🧵👇 - 📽️ A walkthrough of Snorkel Evaluate and Expert Data-as-a-Service on an agentic AI enterprise task - 📅 An upcoming event on Enterprise Agentic AI with innovators from @Accenture @BNY @Comcast @Stanford @QBE & others - 📊 An upcoming series of benchmark datasets and model artifact releases 👀 Want early access to the full agentic AI dataset? Retweet this post and we'll send you the link!

English

15

76

273

49.5K

André Balleyguier

ค้นพบ