Brian Kurtz

334 posts

Brian Kurtz

@Math4Good

Accelerating inference @positron_ai.

Se unió Haziran 2020

154 Siguiendo47 Seguidores

Brian Kurtz@Math4Good·18 Mar

@tunguz So you think models will get smaller and denser over time?

English

Bojan Tunguz@tunguz·18 Mar

And this is the worst it will ever be - this is the groq bottom.

Ben Bajarin@BenBajarin

Groq rack is 8 LPU, 1 FPGA, 1 CPU, 1 BF DPU per tray and 32 trays in a rack.

English

Brian Kurtz@Math4Good·18 Mar

@FelixCLC_ Put another way, show me your attention perf not your expert ffn perf

English

@fclc@FelixCLC_·17 Mar

Rule of thumb: any HW performance claims on LLMs >=100B on less than ~32K tokens is a waste of yours and everyone else's time. Nvidia has brilliant people who know this... Even an explore sub agent wants a good 8K+ these days

English

4.4K

Brian Kurtz@Math4Good·17 Mar

@IanCutress Smokey and the Groq. A whole lot of ffn and not a lot of attention.

English

1.1K

𝐷𝑟. 𝐼𝑎𝑛 𝐶𝑢𝑡𝑟𝑒𝑠𝑠@IanCutress·16 Mar

OK here we go. Same architecture, scaled up. 500 MB SRAM per chip. Jensen saying up to 25% of your datacenter could be groq. 8 way systems for 4 GB SRAM. Use dynamo for attention Decode only. Working over ethernet on special mode to half latency. Samsung LP4X. Ship in 2H/3Q.

English

454

66.1K

Brian Kurtz retuiteado

Mitesh@mitesh711·11 Mar

Positron not only solves the lower latency workload requirement as detailed by @JUmp but also does it at far lower power and cost structure than the alternatives. Thank you Clay for the shoutout.

Wall St Engine@wallstengine

$ORCL CEO WHY DC DON’T NEED TO BE NEAR POPULATION CENTERS, YET: “Inferencing is very rapidly growing everywhere and anywhere... it’s because of higher and higher utilization of the models themselves and also new use cases... inferencing is going to have a huge amount of demand.” “If what you’re doing is asking a question for your business, it’s going to take an AI model several seconds to think about it... an extra 40 milliseconds of latency from New York to Wyoming is not going to hurt you.” “The latency problem right now is not actually the location of the hardware. It’s the type of hardware that’s being deployed, and that’s why you’re seeing so much innovation around these AI accelerators.” “If you look at what Groq does, or Cerebras, or Positron, all of these different types of companies are saying not only how do we reduce the cost of inferencing, but also how can we significantly reduce the latency of it.” “That makes it much more flexible for us to put data centers where power is abundant, land is plentiful, and we can optimize for what’s available to meet this ever-increasing demand.”

English

1.4K

Brian Kurtz@Math4Good·9 Mar

@jukan05 So roughly 2TB weight capacity in total? Surprisingly low.

English

662

Jukan@jukan05·9 Mar

Groq, AI Chip Startup Acquired by NVIDIA, to Ramp Up Production at Samsung Foundry Groq, the AI chip startup that NVIDIA effectively acquired for approximately $20 billion (around 29 trillion won), has reportedly requested a production increase from Samsung Electronics' foundry (contract manufacturing) division. As demand grows for inference-optimized AI chips capable of maximizing power efficiency, Samsung Foundry is expected to deepen its collaboration with Groq and accelerate improvements to its profitability. According to industry sources on the 9th, Groq has recently decided to increase its wafer-based production volume at Samsung Foundry from approximately 9,000 wafers to around 15,000 wafers. While last year's production was essentially at the sample chip level — aimed at evaluating whether the chips could be effectively used for AI inference — this year is seen as marking the early stages of full-scale mass production for commercial deployment. Groq is the AI chip startup that NVIDIA reportedly acquired through an indirect structure in December of last year for approximately $20 billion. Rather than taking direct managerial control, NVIDIA announced it would partner with Groq through a "non-exclusive technology license agreement." Groq CEO Jonathan Ross and other executives are said to have joined NVIDIA following the license deal, tasked with integrating Groq's chip designs into NVIDIA's products. The strategy is understood to have been chosen so that NVIDIA could absorb key talent and achieve an acquisition-equivalent outcome while sidestepping antitrust scrutiny. The process of advancing AI models is typically divided into two phases: "training" and "inference." Training is the stage in which a model "learns" patterns from large volumes of data, while inference is the process of using a trained model to "derive" predictions or conclusions from new data. Companies like NVIDIA and AMD, which currently dominate the AI chip market, mass-produce chips specialized for training. However, growing concerns over excessive power consumption and high chip costs are driving increasing demand for inference-optimized AI chips capable of running AI models more efficiently. The prevailing view is that NVIDIA — already dominant in the training chip market — pursued the indirect acquisition of Groq in order to extend its ecosystem into the inference market as well. While the volume Groq has commissioned from Samsung is not large, the analysis is that Samsung Foundry aggressively pursued the order as a foundation for securing future inference chip business. In addition to Groq, Samsung Foundry is also the sole manufacturer of processors for HyperAccel, a domestic inference AI chip startup. Samsung produces AI chips for both Groq and HyperAccel on its 4-nanometer (nm) process node. A semiconductor industry official noted: "The 4nm process that Samsung Foundry uses to mass-produce Groq's AI chips incorporates a wide range of improvements aimed at enhancing chip performance. Given that the process carries a high unit cost and that 4–5nm demand is the strongest in the industry, winning this business is also meaningful as a reference win to remain competitive against TSMC. With NVIDIA entering the AI chip market and Groq scaling up production, expectations are growing that the inference AI chip market is on the verge of a full-scale breakout." Meanwhile, market interest in inference-optimized AI chips is intensifying amid reports that NVIDIA plans to unveil an inference-specialized chip at GTC 2026 based on Groq's chip architecture. Industry observers expect NVIDIA to leverage Groq's inference chip design — which uses SRAM (Static RAM) in place of the High Bandwidth Memory (HBM) found in conventional AI chips. Replacing HBM with SRAM in AI chips is said to offer advantages including faster data transfer speeds, improved power efficiency, and lower chip costs.

English

463

95.8K

Brian Kurtz@Math4Good·8 Mar

@chamath What’s the realistic market size though.

English

Chamath Palihapitiya@chamath·7 Mar

I plan to buy and deploy large fleets around the country when possible. Should pay back and be positive FCF < 2 years…

Teslaconomics@Teslaconomics

I plan on owning my own Tesla Robotaxi fleet one day. And the more I run the numbers, the more I realize this new business could become one of the most powerful income opportunities I've ever seen. This is how I'm thinking about it. Based on many analyst models and Tesla’s long-term vision, a reasonable base case assumption is about ~$30,000 per year in net profit per Robotaxi to the owner. This is after things like Tesla’s platform fee, charging, tires, maintenance, insurance, and cleaning. Of course, the network is still early and Tesla is just beginning to roll this out in pilot programs in a few cities, so there’s no official real-world owner earnings yet... but using reasonable assumptions around utilization, pricing per mile, and operating costs, the math starts to get really interesting. If one Robotaxi can earn around $30,000 per year, here’s what a fleet might look like: • $100,000 per year → about 4 Robotaxis • $500,000 per year → about 17 Robotaxis • $1,000,000 per year → about 34 Robotaxis It may sound a bit crazy at first, but when you break it down, it starts to make more sense. These vehicles could potentially drive 50,000 to 100,000+ miles per year in high demand areas. If the economics land somewhere around $0.25-$0.50 profit per mile after all costs, you end up right around that ~$30k per vehicle per year range. And remember, the Tesla’s Robotaxi network is going to work a lot like Airbnb for cars. You add your vehicle to the network, Tesla handles the software, routing, payments, and rider experience, and they take a platform fee (often modeled around 25-35%). The owner keeps the rest after operating costs. Another thing that makes this interesting is the expected cost of the vehicles themselves. Tesla has talked about the purpose-built Cybercabs costing roughly $25k-$30k and Elon told me production is starting in 1 month! If that’s even close to reality, a fleet capable of generating around $1 million per year could theoretically cost somewhere around $850k-$1M in vehicles. That ROI is pretty freakin good! Now to be clear, none of this is guaranteed. I'm just thinking out loud and sharing it with you... a lot still depends on regulations, how fast unsupervised FSD scales, demand in each city, insurance costs, and how Tesla structures the network. But if the system works the way Elon has described it for years, owning a Robotaxi fleet could become one of the most powerful forms of passive income I've ever seen. And I plan on sharing the numbers with everyone on 𝕏 when the day comes. Personally, that’s why I’m paying such close attention. Bc one day, owning a fleet of autonomous Teslas working for me 24/7 might be the modern version of owning a rental property, except instead of tenants, you’ve got robots driving people around all day while you sleep. This next book of Tesla is going to be so exciting!

English

690

516

6.5K

1.4M

Brian Kurtz retuiteado

Mitesh@mitesh711·3 Mar

We will be world’s first terabyte plus memory density silicon and will be in production in 2027. Another cool feature that weaver allows us is to have configurable silicon sku for amount of memory, so instead of only one set amount of memory per chip, we can have anywhere from 576GB to 2304GB per chip based on customer’s application and this can be done at system build out time.

Ben Pouladian@benitoz

$CRDO Q3 call revealed two things the market is sleeping on: Weaver gearbox: 10x memory IO density. Positron building a 2TB inference XPU on it for speed! Lasers are out: ZeroFlap optics 1000x more reliable, half the power. Production ramp Q1 FY27. Listen Credo:

English

2.9K

Brian Kurtz@Math4Good·2 Mar

@KalGrinberg livestream link?

English

Kal Grinberg@KalGrinberg·2 Mar

I spec'd out the mission.. Droid is warning me that this is an enormous project - I like the sound of that

habibi@habibislop

@KalGrinberg @FactoryAI Build a high-fidelity 3d simulation of the earth, outer space, and the moon, and use the codebase for the Apollo 11 Guidance Computer (linked below) to recreate the moon landing within that simulation github.com/chrislgarry/Ap…

English

1.2K

Brian Kurtz@Math4Good·1 Mar

@garrytan @Suhail good solution. i find relying on agents.md has mixed results.

English

Garry Tan@garrytan·1 Mar

@Suhail I basically built this with my exit plan mode skill

Garry Tan@garrytan

Here is my latest iteration of my /plan-exit-review skill for Claude Code. I use this as I exit plan mode and it works super well to shake out all issues,, shake out architecture and code smell issues, perf issues, and finally make sure every part of a PR is tested (I avg 1.3 lines of test to 1 line of real code these days) gist.github.com/garrytan/001f9…

English

168

38.7K

Suhail@Suhail·1 Mar

It feels like someone should make a post-git-hook where it asks the AI model to look at the diff of what you changed for a merged PR and update the repo’s various readmes and other documentation to make it easier for an LLM to be able to write code and reference things faster rather than reading every single line of source code that might be relevant constantly. The agents need their own docs.

English

354

53.8K

Brian Kurtz@Math4Good·28 Şub

@benitoz Groqs compiler is built on the premise of a purely deterministic graph

English

Ben Pouladian@benitoz·28 Şub

Groq isn't about SRAM. It's the compiler NVIDIA paid $20B for the hardest compiler ever built. Schedules everything before silicon wakes up They have the exact IP to fix Groq's weaknesses Dedicated inference engine. No CoWoS. No HBM Additive TAM Custom ASIC & TPU boyz scared

Ben Pouladian@benitoz

lol WSJ don’t know @insane_analyst GTC gonna be epic 🚀🚀

English

139

23.6K

Brian Kurtz retuiteado

Joe Fioti@joefioti·27 Şub

Sneak peek at Luminal Inference OS, here running a vllm-style LLM inference server. It's running slowed down for visualization purposes. Compute graphs are compiled and run near-roofline thanks to the Luminal compiler.

English

5.8K

Brian Kurtz@Math4Good·27 Şub

@FactoryAI love you guys! curious how you might manage/nudge missions over time. So much can be accomplished with up front planning, but its hard to control/track once things are in flight. Curious to see how you can monitor while the things are progressing and nudge if needed

English

1.8K

Factory@FactoryAI·26 Şub

Droids can now pursue goals autonomously over multi-day horizons. You describe what you want, approve the plan, and come back to finished work. We call these Missions.

English

829

370.3K

Brian Kurtz@Math4Good·26 Şub

@balajis The best software has a user base of 1

English

Balaji@balajis·26 Şub

AI is amazing for small-TAM custom software. Indeed, the smaller the market, the more amazing it is. Because small markets typically don’t support the costs of software development.

English

187

171

2.2K

198K

Brian Kurtz@Math4Good·24 Şub

@danhockenmaier @Citrini7 Will the marketplace in the future be humans? Or clawd bots with no care wrt brand?

English

250

Dan Hockenmaier@danhockenmaier·23 Şub

This piece shows a profound lack of understanding of how marketplaces work and why they are defensible. “A competent developer could deploy a functional competitor in weeks, and dozens did, enticing drivers away from DoorDash and Uber Eats by passing 90-95% of the delivery fee through to the driver.” Anyone could have done that at any time in the last ten years. Why was no one able to? Because the hard part has nothing to do with building the app or attracting the drivers. The hard part is building a liquid marketplace with all of the best supply and a massive series of optimizations and investments to drive down prices and delivery times and drive up reliability and quality. DoorDash and Eats have built this when no one else could, and they will not allow agents to transact on their apps, nor will they have a legal requirement to allow it. But the real story isn’t as sensational, so it doesn’t get the engagement.

English

598

563.8K

Citrini@citrini·22 Şub

JUNE 2028. The S&P is down 38% from its highs. Unemployment just printed 10.2%. Private credit is unraveling. Prime mortgages are cracking. AI didn’t disappoint. It exceeded every expectation. What happened? citriniresearch.com/p/2028gic

English

1.9K

4.3K

27.9K

28.6M

Brian Kurtz@Math4Good·22 Şub

@JordanNanos Or 3TB of memory. Enough for 3 ish k2 replicas.

English

Jordan Nanos@JordanNanos·21 Şub

8 exaFLOPs = 64 wafers

Cerebras@cerebras

Proud to partner with @G42ai and @mbzuai (Mohamed bin Zayed University of Artificial Intelligence) to deliver a national-scale AI supercomputer in India with 8 exaflops of compute capacity. This cluster is designed to support researchers, startups, enterprises, and government entities, and will serve as a foundational asset under the India AI Mission, accelerating AI innovation tailored to India’s needs.

Nederlands

5.7K

Brian Kurtz@Math4Good·22 Şub

@vvkgopalan @gfodor The future is dense static models?

English

Vivek Gopalan@vvkgopalan·22 Şub

@Math4Good @gfodor try peering forward into the future my friend

English

gfodor.id@gfodor·21 Şub

Imagine Opus 5 running this fast and you’ll understand why I told you tech debt was going to be inflated away Code will flow like water

Taalas Inc.@taalas_inc

24 dedicated people. $30M spent on development. Extreme specialization, speed, and power efficiency. Today we launch Taalas’ first product. Check it out: Details: taalas.com/the-path-to-ub… Demo chatbot: chatjimmy.ai API: taalas.com/api-request-fo…

English

711

67.1K

Brian Kurtz@Math4Good·22 Şub

@gfodor It really cannot. It’s 1000 times bigger and the weights have to be stored on chip, and somehow they have to maintain the right balance of mults and memory. With sparse models you only activate 5% of chips. Weird and crazy.

English

443

gfodor.id@gfodor·22 Şub

@Math4Good It will run this fast one day, not sure what your point is

English

1.3K

Brian Kurtz@Math4Good·22 Şub

@JordanNanos @taalas_inc What if you could run huge sparse MoEs twice as fast as the SRAM guys, but in one node. Like k2 huge.

English

Jordan Nanos@JordanNanos·21 Şub

Many have experienced the difference between O(10) tok/s and O(100) tok/s and compare small/fast with big/slow models Some have experienced O(1000) tok/s with Groq, Cerebras Now @taalas_inc has demo’d O(10,000) tok/s A glimpse of the future

swyx@swyx

yesterday we chatted with @martin_casado and @sarahdingwang on the pod and he happened to do basic math™ on the logic of asics today @taalas_inc launched their HC1 asic that can inference 17k tok/s. Sure, it's a shitty 3.1 8B today which is a 1.5 year gap. But read the details to the HC2 this winter, and do the math — this timeline will converge to 0 in the next 2 years. Build accordingly.

English

3.3K

Brian Kurtz@Math4Good·22 Şub

@meowbooksj What about 560 tk/s on kimi k2?

English

218

meow☢️@meowbooksj·21 Şub

hardware people know that this is both impressive and sad product

Taalas Inc.@taalas_inc

English

432

43.5K

Descubrir

@tunguz @FelixCLC_ @IanCutress @JUmp @jukan05 @chamath @KalGrinberg @garrytan