Sumuk

1.3K posts

Sumuk

Sumuk

@sumukx

resident @PrimeIntellect | prev @huggingface | uiuc phd

San Francisco, CA Katılım Eylül 2023
824 Takip Edilen638 Takipçiler
Sabitlenmiş Tweet
Sumuk
Sumuk@sumukx·
we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work early access link in replies! (1/8)
Sumuk tweet media
English
14
48
292
47.9K
Chris 🇨🇦
Chris 🇨🇦@llm_wizard·
BANGER SESSION. @twominutepapers bringing the best possible energy to this panel.
NVIDIA AI Developer@NVIDIAAIDev

Catch the high-energy GTC panel with top NVIDIA researchers, hosted by Károly Zsolnai-Fehér of @TwoMinutePapers, now available on YouTube. 📹 nvda.ws/4m9jIbo Hold on to your papers, fellow scholars! 🙌 They dive into the latest breakthroughs in AI, spotlight the most promising emerging technical trends, and candidly explore the biggest open challenges facing the field today. Sanja Fidler | VP, AI Research Yejin Choi | Sr. Research Director Karoly Zsolnai-Fehér | Researcher and Founder | Two Minute Papers Yashraj Narang | Sr. Robotics Research Manager Marco Pavone | Sr. Research Director

English
2
0
8
1.1K
Gauri Gupta
Gauri Gupta@gauri__gupta·
We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).
English
45
43
248
85K
Sumuk
Sumuk@sumukx·
@scaling01 come on man those model sizes mean absolutely nothing as it doesn't really relate to raw inference cost show me the active param counts and i'll be curious
English
0
0
0
309
Lisan al Gaib
Lisan al Gaib@scaling01·
my estimate for Anthropic model sizes: - Haiku: 200-500B @ $5 - Sonnet: 700B-1.4T @ $15 - Opus: 1.5-3T @ $25 - Mythos: 6-20T @ $100+
English
84
46
2.2K
449K
Sumuk
Sumuk@sumukx·
@pmarca What do you do when the marginal cost of a unit of model labor is cheaper than human labor? Keep regulation in place just to have humans employed?
English
0
0
0
407
Marc Andreessen 🇺🇸
Claude responds: The "Rising Cost of Existence" Claim Has No Mechanism — And the Default Technological Trajectory Is the Opposite The argument asserts that AI pushes the cost of human existence up. This is stated, not argued. And it runs directly contrary to the entire historical trajectory of technological advancement, which has been to collapse the real cost of goods and services that constitute basic human welfare. Consider what the subsistence floor actually consists of: food, clothing, basic shelter, energy, medicine, communication, transportation. Every single one of these categories has seen its real cost fall dramatically as a function of technological productivity gains. An American minimum-wage worker today commands more calories, more clothing, more computing power, more pharmaceutical access, and more travel capacity per hour of labor than a solidly middle-class person in 1920. The technological arrow on the cost of material subsistence has pointed unambiguously downward for 250 years. AGI accelerates this, not reverses it. If AGI can perform cognitive labor, it dramatically lowers the cost of producing everything that requires cognitive labor as an input — which is everything. Drug discovery gets cheaper. Medical diagnosis gets cheaper. Legal services get cheaper. Engineering design gets cheaper. Agricultural optimization gets cheaper. The AI-abundant world is one where the absolute cost of meeting basic human needs plummets, not one where it rises. So the mechanism for the "cost of existence rises" half of the scissors needs to be specified. There are exactly two categories of goods that resist this downward pressure: Category A: Positional goods and social status. Being in the top 10% of income is by definition zero-sum. If AI makes everyone richer, relative rank competition intensifies. But this is about relative impoverishment, not absolute destitution. Confusing these two is a serious analytical error. Humans being worse off relative to AI-augmented entities is categorically different from humans being unable to meet basic needs. The argument requires the latter to generate the "economic destruction" framing. Category B: Location-constrained goods — primarily housing. This is the strongest version of the real concern. Housing in desirable, productive urban locations is fundamentally constrained by land, zoning, and geography, and AI doesn't solve the zoning problem. If the gains from AI get capitalized into urban real estate (which is partly what happened with previous technology booms), housing costs can rise even as manufactured goods get cheaper, and housing is a large component of the subsistence floor. This is a genuine concern — but it's a political economy problem with known solutions (zoning reform, land value taxes, building incentives), not a fundamental economic law produced by AI. Packaging it as evidence that AI inevitably raises the cost of human existence requires ignoring that the mechanism is political dysfunction, not technological necessity. The technology doesn't raise housing costs. Regulatory capture and NIMBYism raise housing costs. These are separable.
Roko 🐉@RokoMijic

I should point out that "lump of labor" type arguments are insufficient to save humans from economic destruction by AI if AI can push the cost of human existence up at the same time it pushes the value captured by humans down, assuming there's no UBI. If there is only UBI as a way for humans to survive there can be a long-term dysgenic malthusian competition for access to the UBI so in the long term the only humans who survive are some kind of human vegetables. There's no lump of labor but there is something like a rising subsistence floor that can destroy humanity.

English
91
47
374
99.6K
Sumuk
Sumuk@sumukx·
@levelsio beautiful, wow a non dead comments section built on social proof
English
0
0
0
37
@levelsio
@levelsio@levelsio·
Okay let's see who can reply to this
English
2.5K
17
2.1K
1M
Sumuk
Sumuk@sumukx·
@GlennMatlin You genuinely just need to vibecode your own tools now. No more package shop
English
0
0
1
45
Sumuk
Sumuk@sumukx·
As much as I like litellm, the argument has never been stronger for why you should never use external libraries. Ask codex/claude to make mini tools for you. Reduce your attack surface. Stop using external packages and libraries. Set sane budget limits
Daniel Hnyk@hnykda

LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below

English
0
0
1
259
Sumuk
Sumuk@sumukx·
large parts of the bay area will soon be unable to pay their mortgages as the era of abundant 500k+ tech jobs comes to an end game theoretically, it's in the model providers' best interests to: maximize: (human + ai) productivity minimize: (ai - human) productivity
Michaël Trazzi@MichaelTrazzi

On our way to OpenAI!

English
0
0
6
319
Sumuk
Sumuk@sumukx·
@gdb 5.4 needs more instruction tuning! pls fix for 5.5!
English
1
0
6
1K
Sumuk
Sumuk@sumukx·
@thsottiaux Tibo can I have a “slow mode” please for codex?
English
0
0
3
297
Tibo
Tibo@thsottiaux·
What are we consistently getting wrong with codex that you wish we would improve / fix?
English
1.2K
14
872
144.5K
Sumuk
Sumuk@sumukx·
@edwinarbus @cursor_ai damn are you guys doing better than anthropic directly RLing with the cc harness is it the same cursor harness for diff models?
English
1
0
14
4.6K
edwin
edwin@edwinarbus·
Matt Maher tested frontier models in Cursor v. other harnesses. Cursor boosted model performance by 11% on average: Gemini: 52% → 57% GPT-5.4: 82% → 88% Opus: 77% → 93% His benchmark measures how well models implement a 100-feature PRD. @cursor_ai consistently outperformed.
English
120
117
1.3K
844.3K
Sumuk
Sumuk@sumukx·
@kalomaze i was today years old when i heard the term RLVR brain
English
0
0
1
143
kalomaze
kalomaze@kalomaze·
i think RLVR brain is a real phenomenon and it has been localized to how claude behaves in a harness this is a question about regressions, and even then, the agent (deep in-context) cannot help itself, it must expand scope instead of looking at the scope of things it got rid of
kalomaze tweet media
English
4
0
62
3.4K
Sumuk
Sumuk@sumukx·
@realchillben Where is 5.4 medium and xhigh though? 🤔
English
1
0
1
54
Sumuk
Sumuk@sumukx·
@thsottiaux can we please have ssh support in the codex app? one of the only reasons i need to keep using the cli (claude code already has it!)
English
1
0
0
208
Tibo
Tibo@thsottiaux·
“Codex App has transformed the way I write software… I barely use anything else these days” ... and yet we're only getting started
English
144
24
1.2K
55.1K
Sumuk
Sumuk@sumukx·
@thsottiaux appreciation post. Even with load / outages, keeping us updated is so helpful, compared to what anthropic does when there’s a Claude outage lol
English
0
0
1
162