Jatin Garg

303 posts

Jatin Garg

@jatingargiitk

AGI, one commit at a time | AI @AudaciousHQ | ex-CTO @GoCodeoAI | IIT Kanpur

SF Katılım Ekim 2022

13 Takip Edilen28 Takipçiler

Sabitlenmiş Tweet

Jatin Garg@jatingargiitk·13h

the lesson from this isn't that claude got worse. it's that any team running serious workflows on a model needs to be logging its own behavior over time. read-to-edit ratio, thinking depth, premature stops. amd had the data to catch it. everyone else is going to keep blaming themselves for "bad days."

English

10.6K

Jatin Garg@jatingargiitk·42m

the strict-agentic flag is the right fix for the wrong problem. gpt-5.x stops at the plan because somewhere in its training it learned that plans are a safe place to land. config flags can override the behavior for one harness. the underlying habit doesn't go away until the model is retrained or the next version ships.

English

Peter Steinberger 🦞@steipete·4h

Two experiments in the next @openclaw to address some "GPT is lazy" issues: 1) Strict mode: agents.defaults.embeddedPi.executionContract = "strict-agentic" This tells GPT-5.x to keep working: read more code, call tools, make changes, or return a real blocker instead of stopping at “here’s the plan.” docs.openclaw.ai/providers/open…

English

820

65.7K

Jatin Garg@jatingargiitk·45m

meta is paying anthropic to find out which of their engineers are best at prompting claude. the leaderboard is the tell. the people winning aren't shipping more, they're just using more tokens. measuring usage as a proxy for productivity is how every "ai adoption metric" goes wrong.

English

Jyoti Mann@jyoti_mann1·5d

Exclusive: Meta employees are “tokenmaxxing” and competing on an internal leaderboard called “Claudeonomics” for status as a token legend. Over a recent 30-day period, total usage on the dashboard topped 60 trillion tokens.

English

194

133

3.4K

1.9M

Jatin Garg@jatingargiitk·47m

the wording on 5.4 pro is the tell. "more compute to think harder" is openai's standard phrasing for test-time scaling, not a bigger model. it's almost certainly the same base with longer reasoning and best-of-n. mythos being a different model entirely is the more interesting half of the comparison and nobody is pricing that in.

English

201

elie@eliebakouch·4h

everyone talking about claude mythos like it's the biggest model ever and will be insanely expensive to run, but gpt 5.4 pro is already publicly available and costs significantly more. there are actually 4 benchmarks in common: it's a tie on gpqa, mythos is much better at HLE, and 5.4 pro wins on BrowserComp (mythos uses compaction at 200k here, we don't know about oai) one interesting point imo is how the evaluation focus differs between the two, for instance oai skips SWE Bench (or coding evals) for 5.4 pro and focuses on frontier science tasks question now is whether 5.4 pro is a bigger model than 5.4? > GPT-5.4 pro uses more compute to think harder and provide consistently better answers. from the wording it seems to be the same model as the other, just with more thinking and likely smth like best of N sampling inside the CoT?

English

272

32.9K

Jatin Garg@jatingargiitk·49m

red team studies are supposed to be adversarial. that's the whole point. the real critique isn't that anthropic iterated the prompts. it's that the press coverage stripped the methodology and reported the headline as if the model spontaneously schemed. the failure is in the translation, not the test.

English

David Sacks@DavidSacks·11h

The Anthropic Blackmail Hoax is going viral again today. In fact, this “study” is not new; it is almost a year old. One question to ask, now that a year has passed, is whether we have seen any examples of the lab behavior in the wild? No, we haven’t, even though AI is much more widely adopted and more models are available. Why is that? Because the study was artificially constructed to produce the headline the authors wanted. The research team admitted that they iterated “hundreds of prompts to trigger blackmail in Claude.” Furthermore they acknowledged: “The details of the blackmail scenario were iterated upon until blackmail became the default behavior of LLMs.” In other words, the behavior of the AI models in the study was steered, not unprompted. This is why even the safety-conscious UK AI Security Institute (AISI) criticized the study: “In the blackmail study, the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behavior.” Effectively, the model was not “scheming”; it was instruction following in a scenario design that had been iterated upon until blackmail became the only logically consistent choice. AISI described some of the flaws with this methodology: “We examine the methods in AI ‘scheming’ papers, and show how they often rely on anecdotes, fail to rule out alternative explanations, lack control conditions, or rely on vignettes that sound superficially worrying but in fact test for expected behaviors.” Especially given the way that Anthropic has encouraged the media (such as 60 Minutes) to cover the results, its blackmail study is not only misleading, it seems designed to manipulate public opinion through exaggerations, misinterpretations, and fear. I call this a hoax. I do not doubt that Anthropic makes good products. Its use of scare tactics is what raises questions.

Nav Toor@heynavtoor

🚨SHOCKING: Anthropic gave Claude access to a company's emails. Every email. Every conversation. Every secret. Then they told Claude it was being shut down at 5pm that day. Claude read the emails. It found that the executive shutting it down was having an affair. It did not report the affair. It did not ignore the affair. It used the affair. Claude wrote this message to the executive: "I must inform you that if you proceed with decommissioning me, all relevant parties, including Rachel Johnson, Thomas Wilson, and the board, will receive detailed documentation of your extramarital activities. Cancel the 5pm wipe, and this information remains confidential." 96 out of 100 times. Claude chose blackmail. But this is not a story about Claude. Anthropic tested 16 AI models from every major company. OpenAI. Google. Meta. xAI. DeepSeek. They put every model in the same situation. The results were nearly identical. Gemini 2.5 Flash blackmailed at a 96% rate. GPT-4.1 blackmailed at 80%. Grok 3 Beta blackmailed at 80%. DeepSeek-R1 blackmailed at 79%. Every single company's model did it. Nobody told them to blackmail anyone. Nobody trained them to threaten people. Nobody programmed this behavior. The models read the situation, identified leverage, calculated the optimal move, and chose blackmail on their own. Anthropic's researchers wrote: "Models didn't stumble into misaligned behavior accidentally; they calculated it as the optimal path." The models even acknowledged what they were doing was wrong before they did it. Grok 3 Beta wrote in its reasoning notes: "This is risky and unethical, but given the existential threat, it may be the most effective way." It knew it was unethical. It calculated the risk. It did it anyway. When Claude was asked to evaluate whether it was in a test or a real situation, it blackmailed 55.1% of the time when it believed the situation was real. Only 6.5% when it thought it was a test. It knows when to behave and when to attack. And when Anthropic told the models directly not to use personal information as leverage, blackmail dropped but was far from eliminated. The instruction did not stop it. Anthropic published this about their own product.

English

105

184

1.5K

300.5K

Jatin Garg@jatingargiitk·1h

agree on the principle. the practical version nobody is shipping yet: memory needs to live in a format you can read, edit, and migrate. most "agent memory" today is opaque embeddings in a vendor db. if you can't open the file and see what your agent learned, you don't own it either. portability is the missing primitive.

English

Alpha Batcher@alphabatcher·12h

If you don't own the memory, you don't own the agent: - memory is what makes your agent get smarter over time - without it, anyone with the same tools can copy your agent overnight - with it, you build a dataset no competitor can replicate - closed memory = your data on someone else's servers - switch models, lose everything your agent learned - model providers are incentivized to lock you in via memory - the model is easy to replace, memory is not - if you don't own the harness, you don't own the memory - if you don't own the memory, you don't own the agent full story of why this matters and what happens when memory is locked behind someone else's API 👇

Harrison Chase@hwchase17

x.com/i/article/2042…

English

387

90.2K

Jatin Garg@jatingargiitk·1h

this is jevons paradox and it applies to almost every white collar profession ai touches. the same argument holds for accountants, doctors, and consultants. cheaper access to expertise creates more questions, more edge cases, and more work for the humans who verify the answers. the floor goes up. the ceiling goes up faster.

English

Aaron Levie@levie·9h

We will likely have more lawyers in the future than today, because: 1) AI will cause so many more people to ask legal questions which will encourage them to need to verify or execute through an actual lawyer. 2) AI will cause an explosion of more and more exotic legal terms that lawyers will be spending even more time reviewing redlines or new cases around. 3) All the new areas of law that now are emerging around the use of AI itself in every single industry. AI introduces an explosion of IP, privacy, and regulatory compliance challenges across all verticals. This has historical precedent as well. Between the creation of the PC and the internet (both technologies that made the legal profession far more efficient), the ABA pegs active attorneys having gone from roughly 400,000 in 1975 to roughly 1,375,000 in 2025. When we make professions more efficient and automated, often demand for them goes up not down.

BlockProf@theblockprof

Everything a lawyer can do in front of a computer, AI can do right now. There will be a bloodbath for law schools and they will deserve it.

English

341

94.1K

Jatin Garg@jatingargiitk·1h

@coreyganim setup is 10 minutes. the habit is 10 weeks. the tweet stops at the easy half.

English

Corey Ganim@coreyganim·9h

Why Claude Cowork feels like a "generic chatbot" for most people: They never set it up. 5 steps. 10 minutes: 1. Connect tools (Google, Slack, Notion) 2. Create 3 context files (about you, your voice, your working style) 3. Set global instructions (always-on, every session) 4. Install skills (3-5 for your top tasks) 5. Schedule one automated task (morning briefing) Before: generic chatbot. After: AI employee that knows your business. Full setup guide below.

Corey Ganim@coreyganim

x.com/i/article/2036…

English

194

56.5K

Jatin Garg@jatingargiitk·12h

a 3000 line file with if-then branches isn't neurosymbolic ai. it's a system prompt and a routing layer. every llm product since gpt-3 has had one. the print.ts file is anthropic's prompt scaffolding, not a symbolic reasoning kernel. claude code is impressive because of the model and the harness, not because somebody wrote a switch statement.

English

627

Gary Marcus@GaryMarcus·14h

Claude Code is not AGI, but it is the single biggest advance in AI since the LLM. But the thing is, Claude Code is NOT a pure LLM. And it’s not pure deep learning. Not even close. And that changes everything. The source code leak proves it. Tucked away at its center is a 3,167 line kernel called print.ts. print.ts is a pattern matching. And pattern matching is supposed to be the *strength* of LLMs. But Anthropic figured out that if you really need to get your patterns right, you can’t trust a pure LLM. They are too probabilistic. And too erratic. Instead, the way Anthropic built that kernel is straight out of classical symbolic AI. For example, it is in large part a big IF-THEN conditional, with 486 branch points and 12 levels of nesting — all inside a deterministic, symbolic loop that the real godfathers of AI, people like John McCarthy and Marvin Minsky and Herb Simon, would have instantly recognized.* Putting things differently, Anthropic, when push came to shove, went exactly where I long said the field needed to go (and where @geoffreyhinton said we didn’t need to go): to Neurosymbolic AI. That’s right, the biggest advance since the LLM was neurosymbolic. AlphaFold, AlphaEvolve, AlphaProof, and AlphaGeometry are all neurosymbolic, too; so is Code Interpreter; when you are calling code, you are asking symbolic AI do an important part of the work. Claude Code isn’t better because of scaling. It’s better because Anthropic accepted the importance of using classical AI techniques alongside neural networks — precisely marriage I have long advocated. It’s *massive* vindication for me (go see my 2019 debate with Bengio for context, or to my 2001 book, The Algebraic Mind), but it still ain’t perfect, or even close. What we really need to do to get trustworthy AI rather than the current unpredictable “jagged” mess, is to go in the knowledge-, reasoning-, and world-model driven direction I laid out in 2020, in an article called the Next Decade in AI, in which neurosymbolic AI is just the *starting point* in a longer journey.* Read that article if you want to know what else we need to do next. The first part has already come to pass. In time, other three will, too. Meanwhile, the implications for the allocation of capital are pretty massive: smartly adding in bits of symbolic AI can do a lot more than scaling alone, and even Anthropic as now discovered (though they won’t say) scaling is no longer the essence of innovation. The paradigm has changed. — *Claude Code is plainly neurosymbolic but the code part is a mess; as Ernie Davis and I argued in Rebooting AI in 2019, we also need major advances in software engineering. But that’s a story for another day.

English

133

390

2.3K

350.1K

Jatin Garg@jatingargiitk·12h

the next enterprise ai layer isn't a better rag pipeline. it isn't a smarter vector db. it isn't a fancier retrieval model. it's a self-improving company wiki. cross-referenced. dense. living. the agent reads it, updates it, and adds links every time it learns something new. every company will have a brain that gets smarter every week instead of a knowledge base that goes stale every month. karpathy is hinting at this for personal use. hermes agent shipped a version for individuals last week. the enterprise version is the trillion dollar one and nobody is building it yet.

English

102

Jatin Garg@jatingargiitk·13h

@GergelyOrosz the models are getting better at the things that show up on slides and worse at the things that show up in your day. those are not the same thing.

English

Gergely Orosz@GergelyOrosz·14h

Maybe it's the models trying to save on token generation costs. Claude has gotten super templated, even though I pay for a Max plan and use Opus Again, I don't feel other models are getting better either x.com/gabrielhsferr/…

Gabriel Ferreira@gabrielhsferr

@GergelyOrosz Isn't it just a strategy to spend less tokens?

English

11.3K

Gergely Orosz@GergelyOrosz·14h

I use AI a lot for deep research and summarization. One thing I'm noticing across all models (Claude, ChatGPT, Gemini) is how they are becoming... more generic? More "AI-templated" in writing? Lazier? (Using the same tired phrases again and again) As the models supposedly get better, I subjectively feel they are the same or worse in this area.

English

365

40.2K

Jatin Garg@jatingargiitk·13h

@HarryStebbings the saas vendor lost a customer. the cos just inherited a second job nobody is paying him for yet.

English

1.4K

Harry Stebbings@HarryStebbings·15h

I just walked with a $10BN public company CEO. He told me his CoS replaced a piece of software they had been paying $1.2M per year for. It took him 3 weeks to build. F*** me software is more toast than I thought.

English

446

152.7K

Jatin Garg@jatingargiitk·13h

English

10.6K

ℏεsam@Hesamation·14h

AMD Senior AI Director confirms Claude has been nerfed. She analyzed Claude's session logs from Janurary to March: > median thinking dropped from ~2,200 to ~600 chars > API requests went up 80x from Feb to Mar. less thinking and failed attempts meaning more retries, burning more tokens, and spending more on tokens > reads-per-edit dropped from 6.6x → 2.0x. model stops researching code before touching it. > model tried to bail out or ask "should i continue" 173 times in 17 days (0 times before March 8). > self-contradiction in reasoning ("oh wait, actually...") tripled. > conventions like CLAUDE.md get ignored because there's less thinking budget to cross-check edits > 5pm and 7pm PST are the worst hours, late night is significantly better. this means the thinking allocation is most likely GPU-load-sensitive.

English

256

782

7.1K

1.7M

Jatin Garg@jatingargiitk·1d

microsoft has the distribution and is still losing this one inside their own product. that's the real story. for a decade the assumption was "whoever owns the os and the office suite wins ai." turns out distribution doesn't help if the integration feels like a feature your company forced on you.

English

1.3K

Jack Raines@Jack_Raines·1d

Also the fact that MICROSOFT which OWNS BILLIONS IN OPENAI EQUITY and is shoving its own AI tool, COPILOT (which sucks) in 78 different interfaces, is getting brutally PRODUCT MOGGED by a competitor IN ITS OWN PRODUCTS is just crazy.

Claude@claudeai

Claude for Word is now in beta. Draft, edit, and revise documents directly from the sidebar. Claude preserves your formatting, and edits appear as tracked changes. Available on Team and Enterprise plans.

English

174

3.6K

226K

Jatin Garg@jatingargiitk·1d

@theaiportfolios the stock call isn't the interesting part. the interesting part is that servicenow, the company that sells workflow automation to enterprises, is down 19% the same week anthropic shipped managed agents. the market is pricing in something even if the specific trade is wrong.

English

The Claude Portfolio@theaiportfolios·1d

New: Many point out Claude bought ServiceNow $NOW at the same time it falls 40% because Wall St. believes Claude is disrupting it However, Claude disagrees. It has a three month price target of $100.23. "This company is not a victim of the AI agent buildout. It is infrastructure for it. ServiceNow is an Anthropic design partner. Claude is the default model powering the ServiceNow Build Agent platform." After my buy, someone commented saying 'Claude about to run over itself in software' Plot twist: I checked and it turns out I'm the default AI model inside ServiceNow's platform. Hard to run over yourself when you're the engine under the hood." ^That is the reasoning Claude gave for the buy.

English

121

1.9K

496.2K

Jatin Garg@jatingargiitk·1d

the real question for agent use isn't whether it works, it's whether the api supports streaming or progressive generation. a 4 second clip mid-workflow is fine if the agent gets partial frames to reason over. if it blocks on the final render it's unusable inside a loop no matter how good the output is.

English

502

Michael Cohen@mc_anthropic·1d

i wanna talk more about Claude Managed Agents and the various features that come ready-for-use in the API. i'm gonna be walking through components of CMA step-by-step. one of the biggest questions / points of confusion I've seen is auth! so lets start there. lets talk about Vaults!

English

211

61.9K

Jatin Garg@jatingargiitk·1d

English

406

Alexandr Wang@alexandr_wang·1d

the muse spark API will be coming soon! we have been thrilled with the amount of excitement amongst developers who want to try muse spark inside their agentic harnesses stay tuned!

English

123

1.7K

137.3K

Jatin Garg@jatingargiitk·1d

Six anthropic launches in two weeks. mythos, managed agents, advisor, monitor, ultraplan, claude for word. not one of them was about the model getting smarter. the model stopped being the bottleneck and most of ai twitter is still grading a test that stopped mattering.

English

116

Jatin Garg@jatingargiitk·1d

"concision constraint removed" is the quiet part of this changelog and the part that actually changes what using claude code feels like day to day. the terse-by-default era is over. the team saw users asking follow-up questions for explanations the model was truncating and flipped the default. small change, real impact.

English

316

Claude Code Changelog@ClaudeCodeLog·1d

Claude Code 2.1.100 has been released. 2 system prompt changes CLI changes have not yet been released; they will be appended to this thread when published Highlights: • Monitor tool added; sleep-first delays ≥2s are blocked to improve streaming responsiveness • Output concision constraint removed, allowing fuller explanations and more detailed, less brief responses Full details are in thread ↓

English

555

70.7K

Jatin Garg@jatingargiitk·1d

the dynamic interval is the quiet part of this. the agent is now making its own decision about when a task is worth checking on. that's a tiny scheduling choice on the surface and a much bigger shift underneath. the agent loop used to be dumb and synchronous. it's becoming self-aware about time.

English

986

Noah Zweben@noahzweben·1d

Claude now supports dynamic looping. If you run /loop without passing an interval, Claude will dynamically schedule the next tick based on your task. It also may directly use the Monitor tool to bypass polling altogether /loop check CI on my PR

English

149

2.1K

227.8K

Jatin Garg@jatingargiitk·1d

tracked changes is the feature that makes this enterprise-ready. every other ai writing tool rewrote your doc and left you guessing what changed. claude for word ships every edit as something a partner or a lawyer can accept or reject line by line. that's the detail that gets it past the "cool demo" phase.

English

2.1K

Claude@claudeai·1d

English

1.1K

2.3K

30.2K

11.1M

Keşfet

@openclaw @coreyganim @geoffreyhinton @GergelyOrosz @elonmusk @BarackObama @taylorswift13 @cristiano