jeseem

61 posts

jeseem

@jeseem

https://t.co/OD8I2wEU00

Katılım Mart 2009

288 Takip Edilen33 Takipçiler

jeseem@jeseem·6d

@Yuchenj_UW It’s a problem everywhere in Bay Area. Almost no café open in evening.

English

Yuchen Jin@Yuchenj_UW·6d

Guys, SF is a magical city. Today, a beautiful Wednesday, I had a 4:30pm meeting with a friend at a café… it was closed. We walked 10 blocks, every café was closed. Finally found Blue Bottle, it closed at 5:30. Can some YC company please build what SF people actually want?

English

155

1.3K

134.9K

jeseem@jeseem·26 Şub

@mitchellh why not use both ? I use one to code and the other to review, Opus 4.6 to code and Codex 5.3 to review and then switch at times. It really helps and the review model catches lots of issues.

English

Mitchell Hashimoto@mitchellh·25 Şub

I know this is pretty well established at this point, but Codex 5.3 is a much more effective model than Opus 4.6. I went back and forth on both for a bit, but haven’t touched Opus at all now for a full week. First model to get me off of Opus… ever. Good job Codex team.

English

337

220

5.3K

1.1M

jeseem@jeseem·17 Şub

@karpathy languages that llm are good at will thrive, not what developers are good at. LLMs will prefer more concise language, to keep attention window short. And languages that inherently protect against coding errors and where compilers catch a lot of issues.

English

Andrej Karpathy@karpathy·16 Şub

I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest in upgrading legacy code bases in COBOL or etc. In particular, LLMs are *especially* good at translation compared to de-novo generation because 1) the original code base acts as a kind of highly detailed prompt, and 2) as a reference to write concrete tests with respect to. That said, even Rust is nowhere near optimal for LLMs as a target language. What kind of language is optimal? What concessions (if any) are still carved out for humans? Incredibly interesting new questions and opportunities. It feels likely that we'll end up re-writing large fractions of all software ever written many times over.

Thomas Wolf@Thom_Wolf

Shifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Writing from scratch ¹ or extracting the relevant parts from another library is far easier when you can simply ask a code agent to handle it, rather than spending countless nights diving into an unfamiliar codebase. The reasons to reduce dependencies are compelling: a smaller attack surface for supply chain threats, smaller packaged software, improved performance, and faster boot times. By leveraging the tireless stamina of LLMs, the dream of coding an entire app from bare-metal considerations all the way up is becoming realistic. End of the Lindy effect – The Lindy effect holds that things which have been around for a long time are there for good reason and will likely continue to persist. It's related to Chesterton's fence: before removing something, you should first understand why it exists, which means removal always carries a cost. But in a world where software can be developed from first principles and understood by a tireless agent, this logic weakens. Older codebases can be explored at will; long-standing software can be replaced with far less friction. A codebase can be fully rewritten in a new language. ² Legacy software can be carefully studied and updated in situations where humans would have given up long ago. The catch: unknown unknowns remain unknown. The true extent of AI's impact will hinge on whether complete coverage of testing, edge cases, and formal verification is achievable. In an AI-dominated world, formal verification isn't optional—it's essential. The case for strongly typed languages – Historically, programming language adoption has been driven largely by human psychology and social dynamics. A language's success depended on a mix of factors: individual considerations like being easy to learn and simple to write correctly; community effects like how active and welcoming a community was, which in turn shaped how fast its ecosystem would grow; and fundamental properties like provable correctness, formal verification, and striking the right balance between dynamic and static checks—between the freedom to write anything and the discipline of guarding against edge cases and attacks. As the human factor diminishes, these dynamics will shift. Less dependence on human psychology will favor strongly typed, formally verifiable and/or high performance languages.³ These are often harder for humans to learn, but they're far better suited to LLMs, which thrive on formal verification and reinforcement learning environments. Expect this to reshape which languages dominate. Economic restructuring of open source – For decades, open-source communities have been built around humans finding connection through writing, learning, and using code together. In a world where most code is written—and perhaps more importantly, read—by machines, these incentives will start to break down.⁴ Communities of AIs building libraries and codebases together will likely emerge as a replacement, but such communities will lack the fundamentally human motivations that have driven open source until now. If the future of open-source development becomes largely devoid of humans, alignment of AI models won't just matter—it will be decisive. The future of new languages – Will AI agents face the same tradeoffs we do when developing or adopting new programming languages? Expressiveness vs. simplicity, safety vs. control, performance vs. abstraction, compile time vs. runtime, explicitness vs. conciseness. It's unclear that they will. In the long term, the reasons to create a new programming language will likely diverge significantly from the human-driven motivations of the past. There may well be an optimal programming language for LLMs—and there's no reason to assume it will resemble the ones humans have converged on. TL; DR: - Monoliths return – cheap rewriting kills dependency trees; smaller attack surface, better performance, bare-metal becomes realistic - Lindy effect weakens – legacy code loses its moat, but unknown unknowns persist; formal verification becomes essential - Strongly typed languages rise – human psychology mattered for adoption; now formal verification and RL environments favor types over ergonomics - Open source restructures – human connection drove the community; AI-written/read code breaks those incentives; alignment becomes decisive - New languages diverge – AI may not share our tradeoffs; optimal LLM programming languages may look nothing like what humans converged on ¹ x.com/mntruell/statu… ² x.com/anthropicai/st… ³ wesmckinney.com/blog/agent-erg… ⁴ #issuecomment-3717222957" target="_blank" rel="nofollow noopener">github.com/tailwindlabs/t…

English

701

656

8.1K

1.2M

jeseem@jeseem·10 Şub

how much time do you spend in claude or codex planning phase ? #Claude #codex

English

jeseem@jeseem·18 Ara

@deedydas Using Claude by itself, becomes a tool for faster coding or to some level automating coding. But there’s more work needed to put the code in context of the whole project.

English

Deedy@deedydas·15 Ara

I'm not saying *everything* is solved. But Opus 4.5 seems to be this giant step function from say ~60% to ~80% of tasks where even L5/6 engineers feel like they mostly check code, or reprompt. Their job is just prompting for 5-10min tasks in 3-4 git worktrees at a time.

English

198

38.3K

Deedy@deedydas·15 Ara

A few software engineers at some of the best tech cos told me this week "My entire job these days is prompting Cursor or Claude Code with Opus 4.5 to do what I need and sanity checking it." We've crossed some intangible threshold of AI generalizing to "most" software.

English

429

452

5.5K

jeseem@jeseem·12 Ara

@SemiAnalysis_ This is so spot-on. 50% MFU effectively means you are using only 50% of total max. capacity. In any other cloud service, no one will accept such performance. Improving MFU for large clusters is one of the most urgent work needed. It's not an easy problem but the rewards are clear

English

SemiAnalysis@SemiAnalysis_·11 Ara

Everyone likes to compare chips using peak FLOPs, but the additional puzzle to the formula of Performance per TCO is Model Flops Utilization (MFU) MFU is the real-world metric that determines how much compute you actually extract per dollar. And MFU varies wildly across users, workloads, scaling sizes, and software maturity. Most people don’t realize how sensitive it is: small changes in kernels, sharding strategies, or batch-size decisions can swing utilization by 2–3×. MFU drops as clusters scale up because communication overhead grows faster than compute. A well-tuned single node might hit 40 to 50%, but at multi-thousand-GPU scale, diminishing returns take over. This is why even elite runs like Llama-3-405B only achieved ~43% BF16 MFU on 16,000 H100s (however this is for Dense Models, whereby MoE tends to add irregularity reducing MFU) , and why poorly tuned workloads routinely sit in the 10 to 20% range. Precision also matters: FP8 MFU is far lower than BF16 MFU, and early-life platforms also tend to have lower MFU figures before software matures. Big labs are better - OpenAI, Anthropic, Meta, Google - all have expert teams and custom kernels that push MFU higher than the average user.

English

106

13.5K

jeseem@jeseem·9 Ara

@chelseabfinn thank you. was looking forward to watching this since I read about the stanford course

English

184

Chelsea Finn@chelseabfinn·9 Ara

All of my Deep RL course lecture videos from Spring 2025 are now online! 🥳 Youtube playlist: youtube.com/watch?v=EvHRQh…

YouTube

English

392

3.4K

233.6K

jeseem@jeseem·6 Ara

@arif_ahmad_py @anand_bhattad @jathushan @HaoLi81 @konpatp @GurushaJuneja This is a master class in how to present a poster session. It shows the experience in how the core idea is presented in its most basic form. Very interesting paper and amazing presentation skills

English

344

Arif Ahmad@arif_ahmad_py·5 Ara

We need more senior researchers camping out at their posters like this. Managed to catch 10 minutes of Alyosha turning @anand_bhattad’s poster into a pop-up mini lecture. Extra spark after he spotted @jathushan. Other folks in the audience: @HaoLi81 @konpatp @GurushaJuneja.

English

144

1.4K

201.5K

jeseem@jeseem·5 Ara

@blelbach Was waiting to try this after seeing the last present at PyTorch conference. Is it supported in PyTorch also?

English

Bryce Adelstein Lelbach@blelbach·5 Ara

CUDA Tile has shipped! You can now `pip install cuda-tile`. I'm excited to see what y'all will build with it! Docs & resources: developer.nvidia.com/cuda/tile GitHub: github.com/NVIDIA/cutile-…

Bryce Adelstein Lelbach@blelbach

We've announced cuTile, a tile programming model for CUDA! It's an array-based paradigm where the compiler automates mem movement, pipelining & tensor core utilization, making GPU programming easier & more portable. I'm proud of my stellar team for all their hard work on this!

English

549

95.4K

jeseem@jeseem·5 Ara

@GoogleResearch This has definitely captured one of human memory retention concept imho. Ad orgs probably have used these and similar ideas to capture human attention. Very interesting work

English

101

Google Research@GoogleResearch·4 Ara

Today at #NeurIPS2025, we present Titans, a new architecture that combines the speed of RNNs with the performance of Transformers. It uses deep neural memory to learn in real-time, effectively scaling to contexts larger than 2 million tokens. More at: goo.gle/3Kd5ojF

English

265

1.8K

431.7K

jeseem@jeseem·5 Ara

@nouhadziri Congratulations

English

266

Nouha Dziri@nouhadziri·4 Ara

Beyond delighted about our #NeurIPS2025 Best Paper Award 🥳🥳😍😍🥇🥇🥇

English

809

37.3K

jeseem@jeseem·4 Ara

@dylan522p the change to NL72 from 36 is interesting. Is there any work in offloading inter-rack traffic, aggregating at rack level, like my expanding Nvidia SHARP ?

English

1.5K

Dylan Patel@dylan522p·4 Ara

Only NVIDIA, AWS, Google have successfully deployed rack scale architecture Trn3 is the 2nd after Nvidia with switched scale up topology, which is better for frontier mixture of experts models AWS Trainium3 course corrects on software Switch design choices are... interesting

SemiAnalysis@SemiAnalysis_

AWS Trainium3 Deep Dive, A Potential Challenger Approaching, Step-Function Software & System Improvements, “Amazon Basics” GB200 NVL36x2, NL72x2/NL32x2 Scale Up Rack Architecture, Optimized Perf per TCO, Trainium4 newsletter.semianalysis.com/p/aws-trainium…

Coronado, CA 🇺🇸 English

336

124.9K

jeseem@jeseem·4 Ara

Some interesting presentations and paper talk at NeurIPS. Some very relevant papers, directly relevant to me like the multi-turn agent reasoning. One of the work is training RL on dynamic state data and multi-turn agents. #NeurIPS2025

English

124

jeseem@jeseem·3 Ara

@chrmanning Hi Prof Manning, are you presenting any session at NeurIPS

English

357

Christopher Manning@chrmanning·3 Ara

I’m now in San Diego for NeurIPS too! 😆

English

326

24K

jeseem@jeseem·3 Ara

@Mascobot @a16z this is the best dev server option. earlier this year, after struggling to find GPUs, I was very tempted to buy a 2-RTX6000 server. This is way better, especially for training smaller model or where the RL is slow as the env response is slow

English

Marco Mascorro@Mascobot·24 Eki

Jensen came to our @a16z's Runtime event and he signed our very first personal GPU AI Workstation Founders Edition (4x RTX 6000 Pro Blackwell) "To a16z Builders of Tomorrow!" - Jensen

Marco Mascorro@Mascobot

🚨 New: We built @a16z's personal GPU AI Workstation Founders Edition - 4x NVIDIA RTX 6000 PRO Blackwell Max-Q (384GB total VRAM) - 8TB of NVMe PCIe 5.0 storage - AMD Threadripper PRO 7975WX (32 cores, 64 threads) - 256GB ECC DDR5 RAM - 1650Watts at peak (runs on a standard 15Amp/120V circuit). For training, AI research, and deploying models locally. A datacenter-class AI rig you can keep under your desk. We are planning to make a limited number of these a16z AI Workstations. Build guide + how you can make your own 👇

English

333

78.7K

jeseem@jeseem·2 Ara

Attending the NeurIPS conference #NeurIPS2025

English

jeseem@jeseem·2 Ara

Recharging my and me at TeslaDiner on my way to NeurIPS

English

jeseem@jeseem·17 Kas

@yunwei37 thanks for the excellent blog post, something I was looking for.

English

200

云微@yunwei37·16 Kas

A blog for GPU observability tools using bpftime eunomia.dev/blog/2025/10/1…

English

216

17.8K

jeseem@jeseem·8 Kas

@GarrettDrinon The best swing imho was Metsera $MTSR. The entry was on Novo bid and the bidding war is still on.

English

270

Garrett Drinon@GarrettDrinon·7 Kas

continuing to be verrry selective with swing trades the volatility of market oscillations is telling me to wait sometimes it’s the action that carves out a top sometimes it’s just a pause could change on a dime but… my process tells me to shorten the timeframe, be patient, scalp out risk, and take the money when it’s there

English

4.9K

jeseem@jeseem·31 Eki

@ezyang Uneven sharding and fsdp FSDP and cpu offloading and Device mesh; process group initialization having these would be great for the podcast

English

Edward Z. Yang@ezyang·31 Eki

I've been brainstorming episodes for the next season of PyTorch Developer Podcast. DTensor StridedShard, FSDP-TP order Redistributing a DTensor Prefetching vs Bucketing History of FSDP in PyTorch Multiprocessing: DataParallel versus DistributedDataParallel Monarch Parallelism Zoo Mixture of Experts and Expert Parallelism The Peak Memory Triangle: Activations FSDP and CPU Offloading Overlap: How to get it (Prefetching, Pipelining, Async TP) Differentiable collectives and variance Local map: global versus local SPMD FSDP vs TP Symmetric memory Uneven sharding and FSDP LocalTensor Composable parallelism via DTensor Pipeline parallelism Functional collectives and wait Device mesh; process group initialization Distributed checkpointing Activation checkpointing Placement: Partial reductions Implicit versus explicit prefetching RNG in a distributed setting Distributed optimizers, Zero, Shampoo, Muon torchtitan torchrun / torchx Choosing your parallelism from first principles / roofline analysis Mixture of Experts: as large as possible, expert routing as fine as possible (more sparsity the better) by minimizing hidden dim GB200 MXFP8 (1x128, 128x128, transposes) Stats of a training job: loss curve, MFU, expert balance Bitwise determinism: when you can expect it Distributed inference RL from an infra perspective Basics of observabilty on jobs NCCL timeout

English

390

21.1K

Keşfet

@Yuchenj_UW @mitchellh @karpathy @deedydas @SemiAnalysis_ @chelseabfinn @arif_ahmad_py @anand_bhattad