jeseem

61 posts

jeseem banner
jeseem

jeseem

@jeseem

https://t.co/OD8I2wEU00

Katılım Mart 2009
288 Takip Edilen33 Takipçiler
jeseem
jeseem@jeseem·
@Yuchenj_UW It’s a problem everywhere in Bay Area. Almost no café open in evening.
English
0
0
0
36
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
Guys, SF is a magical city. Today, a beautiful Wednesday, I had a 4:30pm meeting with a friend at a café… it was closed. We walked 10 blocks, every café was closed. Finally found Blue Bottle, it closed at 5:30. Can some YC company please build what SF people actually want?
English
155
17
1.3K
134.9K
jeseem
jeseem@jeseem·
@mitchellh why not use both ? I use one to code and the other to review, Opus 4.6 to code and Codex 5.3 to review and then switch at times. It really helps and the review model catches lots of issues.
English
0
0
0
35
Mitchell Hashimoto
Mitchell Hashimoto@mitchellh·
I know this is pretty well established at this point, but Codex 5.3 is a much more effective model than Opus 4.6. I went back and forth on both for a bit, but haven’t touched Opus at all now for a full week. First model to get me off of Opus… ever. Good job Codex team.
English
337
220
5.3K
1.1M
jeseem
jeseem@jeseem·
@karpathy languages that llm are good at will thrive, not what developers are good at. LLMs will prefer more concise language, to keep attention window short. And languages that inherently protect against coding errors and where compilers catch a lot of issues.
English
0
0
0
23
Andrej Karpathy
Andrej Karpathy@karpathy·
I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest in upgrading legacy code bases in COBOL or etc. In particular, LLMs are *especially* good at translation compared to de-novo generation because 1) the original code base acts as a kind of highly detailed prompt, and 2) as a reference to write concrete tests with respect to. That said, even Rust is nowhere near optimal for LLMs as a target language. What kind of language is optimal? What concessions (if any) are still carved out for humans? Incredibly interesting new questions and opportunities. It feels likely that we'll end up re-writing large fractions of all software ever written many times over.
Thomas Wolf@Thom_Wolf

Shifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Writing from scratch ¹ or extracting the relevant parts from another library is far easier when you can simply ask a code agent to handle it, rather than spending countless nights diving into an unfamiliar codebase. The reasons to reduce dependencies are compelling: a smaller attack surface for supply chain threats, smaller packaged software, improved performance, and faster boot times. By leveraging the tireless stamina of LLMs, the dream of coding an entire app from bare-metal considerations all the way up is becoming realistic. End of the Lindy effect – The Lindy effect holds that things which have been around for a long time are there for good reason and will likely continue to persist. It's related to Chesterton's fence: before removing something, you should first understand why it exists, which means removal always carries a cost. But in a world where software can be developed from first principles and understood by a tireless agent, this logic weakens. Older codebases can be explored at will; long-standing software can be replaced with far less friction. A codebase can be fully rewritten in a new language. ² Legacy software can be carefully studied and updated in situations where humans would have given up long ago. The catch: unknown unknowns remain unknown. The true extent of AI's impact will hinge on whether complete coverage of testing, edge cases, and formal verification is achievable. In an AI-dominated world, formal verification isn't optional—it's essential. The case for strongly typed languages – Historically, programming language adoption has been driven largely by human psychology and social dynamics. A language's success depended on a mix of factors: individual considerations like being easy to learn and simple to write correctly; community effects like how active and welcoming a community was, which in turn shaped how fast its ecosystem would grow; and fundamental properties like provable correctness, formal verification, and striking the right balance between dynamic and static checks—between the freedom to write anything and the discipline of guarding against edge cases and attacks. As the human factor diminishes, these dynamics will shift. Less dependence on human psychology will favor strongly typed, formally verifiable and/or high performance languages.³ These are often harder for humans to learn, but they're far better suited to LLMs, which thrive on formal verification and reinforcement learning environments. Expect this to reshape which languages dominate. Economic restructuring of open source – For decades, open-source communities have been built around humans finding connection through writing, learning, and using code together. In a world where most code is written—and perhaps more importantly, read—by machines, these incentives will start to break down.⁴ Communities of AIs building libraries and codebases together will likely emerge as a replacement, but such communities will lack the fundamentally human motivations that have driven open source until now. If the future of open-source development becomes largely devoid of humans, alignment of AI models won't just matter—it will be decisive. The future of new languages – Will AI agents face the same tradeoffs we do when developing or adopting new programming languages? Expressiveness vs. simplicity, safety vs. control, performance vs. abstraction, compile time vs. runtime, explicitness vs. conciseness. It's unclear that they will. In the long term, the reasons to create a new programming language will likely diverge significantly from the human-driven motivations of the past. There may well be an optimal programming language for LLMs—and there's no reason to assume it will resemble the ones humans have converged on. TL; DR: - Monoliths return – cheap rewriting kills dependency trees; smaller attack surface, better performance, bare-metal becomes realistic - Lindy effect weakens – legacy code loses its moat, but unknown unknowns persist; formal verification becomes essential - Strongly typed languages rise – human psychology mattered for adoption; now formal verification and RL environments favor types over ergonomics - Open source restructures – human connection drove the community; AI-written/read code breaks those incentives; alignment becomes decisive - New languages diverge – AI may not share our tradeoffs; optimal LLM programming languages may look nothing like what humans converged on ¹ x.com/mntruell/statu… ² x.com/anthropicai/st… ³ wesmckinney.com/blog/agent-erg…#issuecomment-3717222957" target="_blank" rel="nofollow noopener">github.com/tailwindlabs/t…

English
701
656
8.1K
1.2M
jeseem
jeseem@jeseem·
how much time do you spend in claude or codex planning phase ? #Claude #codex
English
0
0
0
16
jeseem
jeseem@jeseem·
@deedydas Using Claude by itself, becomes a tool for faster coding or to some level automating coding. But there’s more work needed to put the code in context of the whole project.
English
0
0
0
15
Deedy
Deedy@deedydas·
I'm not saying *everything* is solved. But Opus 4.5 seems to be this giant step function from say ~60% to ~80% of tasks where even L5/6 engineers feel like they mostly check code, or reprompt. Their job is just prompting for 5-10min tasks in 3-4 git worktrees at a time.
English
15
6
198
38.3K
Deedy
Deedy@deedydas·
A few software engineers at some of the best tech cos told me this week "My entire job these days is prompting Cursor or Claude Code with Opus 4.5 to do what I need and sanity checking it." We've crossed some intangible threshold of AI generalizing to "most" software.
Deedy tweet media
English
429
452
5.5K
1M
jeseem
jeseem@jeseem·
@SemiAnalysis_ This is so spot-on. 50% MFU effectively means you are using only 50% of total max. capacity. In any other cloud service, no one will accept such performance. Improving MFU for large clusters is one of the most urgent work needed. It's not an easy problem but the rewards are clear
English
0
0
0
25
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
Everyone likes to compare chips using peak FLOPs, but the additional puzzle to the formula of Performance per TCO is Model Flops Utilization (MFU) MFU is the real-world metric that determines how much compute you actually extract per dollar. And MFU varies wildly across users, workloads, scaling sizes, and software maturity. Most people don’t realize how sensitive it is: small changes in kernels, sharding strategies, or batch-size decisions can swing utilization by 2–3×. MFU drops as clusters scale up because communication overhead grows faster than compute. A well-tuned single node might hit 40 to 50%, but at multi-thousand-GPU scale, diminishing returns take over. This is why even elite runs like Llama-3-405B only achieved ~43% BF16 MFU on 16,000 H100s (however this is for Dense Models, whereby MoE tends to add irregularity reducing MFU) , and why poorly tuned workloads routinely sit in the 10 to 20% range. Precision also matters: FP8 MFU is far lower than BF16 MFU, and early-life platforms also tend to have lower MFU figures before software matures. Big labs are better - OpenAI, Anthropic, Meta, Google - all have expert teams and custom kernels that push MFU higher than the average user.
SemiAnalysis tweet media
English
5
11
106
13.5K
jeseem
jeseem@jeseem·
@chelseabfinn thank you. was looking forward to watching this since I read about the stanford course
English
0
0
0
184
Arif Ahmad
Arif Ahmad@arif_ahmad_py·
We need more senior researchers camping out at their posters like this. Managed to catch 10 minutes of Alyosha turning @anand_bhattad’s poster into a pop-up mini lecture. Extra spark after he spotted @jathushan. Other folks in the audience: @HaoLi81 @konpatp @GurushaJuneja.
English
25
144
1.4K
201.5K
jeseem
jeseem@jeseem·
@blelbach Was waiting to try this after seeing the last present at PyTorch conference. Is it supported in PyTorch also?
English
0
0
0
24
jeseem
jeseem@jeseem·
@GoogleResearch This has definitely captured one of human memory retention concept imho. Ad orgs probably have used these and similar ideas to capture human attention. Very interesting work
English
0
0
0
101
Google Research
Google Research@GoogleResearch·
Today at #NeurIPS2025, we present Titans, a new architecture that combines the speed of RNNs with the performance of Transformers. It uses deep neural memory to learn in real-time, effectively scaling to contexts larger than 2 million tokens. More at: goo.gle/3Kd5ojF
Google Research tweet media
English
57
265
1.8K
431.7K
Nouha Dziri
Nouha Dziri@nouhadziri·
Beyond delighted about our #NeurIPS2025 Best Paper Award 🥳🥳😍😍🥇🥇🥇
Nouha Dziri tweet media
English
39
28
809
37.3K
jeseem
jeseem@jeseem·
@dylan522p the change to NL72 from 36 is interesting. Is there any work in offloading inter-rack traffic, aggregating at rack level, like my expanding Nvidia SHARP ?
English
0
0
2
1.5K
Dylan Patel
Dylan Patel@dylan522p·
Only NVIDIA, AWS, Google have successfully deployed rack scale architecture Trn3 is the 2nd after Nvidia with switched scale up topology, which is better for frontier mixture of experts models AWS Trainium3 course corrects on software Switch design choices are... interesting
SemiAnalysis@SemiAnalysis_

AWS Trainium3 Deep Dive, A Potential Challenger Approaching, Step-Function Software & System Improvements, “Amazon Basics” GB200 NVL36x2, NL72x2/NL32x2 Scale Up Rack Architecture, Optimized Perf per TCO, Trainium4 newsletter.semianalysis.com/p/aws-trainium…

Coronado, CA 🇺🇸 English
12
22
336
124.9K
jeseem
jeseem@jeseem·
Some interesting presentations and paper talk at NeurIPS. Some very relevant papers, directly relevant to me like the multi-turn agent reasoning. One of the work is training RL on dynamic state data and multi-turn agents. #NeurIPS2025
English
0
0
1
124
jeseem
jeseem@jeseem·
@chrmanning Hi Prof Manning, are you presenting any session at NeurIPS
English
0
0
0
357
Christopher Manning
Christopher Manning@chrmanning·
I’m now in San Diego for NeurIPS too! 😆
English
10
6
326
24K
jeseem
jeseem@jeseem·
@Mascobot @a16z this is the best dev server option. earlier this year, after struggling to find GPUs, I was very tempted to buy a 2-RTX6000 server. This is way better, especially for training smaller model or where the RL is slow as the env response is slow
English
0
0
0
47
Marco Mascorro
Marco Mascorro@Mascobot·
Jensen came to our @a16z's Runtime event and he signed our very first personal GPU AI Workstation Founders Edition (4x RTX 6000 Pro Blackwell) "To a16z Builders of Tomorrow!" - Jensen
Marco Mascorro tweet media
Marco Mascorro@Mascobot

🚨 New: We built @a16z's personal GPU AI Workstation Founders Edition - 4x NVIDIA RTX 6000 PRO Blackwell Max-Q (384GB total VRAM) - 8TB of NVMe PCIe 5.0 storage - AMD Threadripper PRO 7975WX (32 cores, 64 threads) - 256GB ECC DDR5 RAM - 1650Watts at peak (runs on a standard 15Amp/120V circuit). For training, AI research, and deploying models locally. A datacenter-class AI rig you can keep under your desk. We are planning to make a limited number of these a16z AI Workstations. Build guide + how you can make your own 👇

English
30
30
333
78.7K
jeseem
jeseem@jeseem·
Recharging my and me at TeslaDiner on my way to NeurIPS
jeseem tweet mediajeseem tweet media
English
0
0
0
26
jeseem
jeseem@jeseem·
@yunwei37 thanks for the excellent blog post, something I was looking for.
English
0
0
1
200
jeseem
jeseem@jeseem·
@GarrettDrinon The best swing imho was Metsera $MTSR. The entry was on Novo bid and the bidding war is still on.
English
1
0
1
270
Garrett Drinon
Garrett Drinon@GarrettDrinon·
continuing to be verrry selective with swing trades the volatility of market oscillations is telling me to wait sometimes it’s the action that carves out a top sometimes it’s just a pause could change on a dime but… my process tells me to shorten the timeframe, be patient, scalp out risk, and take the money when it’s there
English
2
1
44
4.9K
jeseem
jeseem@jeseem·
@ezyang Uneven sharding and fsdp FSDP and cpu offloading and Device mesh; process group initialization having these would be great for the podcast
English
0
0
0
60
Edward Z. Yang
Edward Z. Yang@ezyang·
I've been brainstorming episodes for the next season of PyTorch Developer Podcast. DTensor StridedShard, FSDP-TP order Redistributing a DTensor Prefetching vs Bucketing History of FSDP in PyTorch Multiprocessing: DataParallel versus DistributedDataParallel Monarch Parallelism Zoo Mixture of Experts and Expert Parallelism The Peak Memory Triangle: Activations FSDP and CPU Offloading Overlap: How to get it (Prefetching, Pipelining, Async TP) Differentiable collectives and variance Local map: global versus local SPMD FSDP vs TP Symmetric memory Uneven sharding and FSDP LocalTensor Composable parallelism via DTensor Pipeline parallelism Functional collectives and wait Device mesh; process group initialization Distributed checkpointing Activation checkpointing Placement: Partial reductions Implicit versus explicit prefetching RNG in a distributed setting Distributed optimizers, Zero, Shampoo, Muon torchtitan torchrun / torchx Choosing your parallelism from first principles / roofline analysis Mixture of Experts: as large as possible, expert routing as fine as possible (more sparsity the better) by minimizing hidden dim GB200 MXFP8 (1x128, 128x128, transposes) Stats of a training job: loss curve, MFU, expert balance Bitwise determinism: when you can expect it Distributed inference RL from an infra perspective Basics of observabilty on jobs NCCL timeout
English
21
29
390
21.1K