Sammy Milton-Tomkins

73 posts

Sammy Milton-Tomkins

@Miltonsammy_

Founder @NexaCoreio Dedicated GPU infrastructure for AI teams

England, United Kingdom Katılım Mart 2026

257 Takip Edilen21 Takipçiler

Sammy Milton-Tomkins@Miltonsammy_·2d

@osttoo @OpenAIDevs This usually shows up when the infra layer isn’t keeping pace with real time demand. Do you seem to be seeing this more from scaling load or from how the workloads are being scheduled?

English

Ostto@osttoo·21 Mar

@OpenAIDevs Container cold starts were killing us on our AI sales platform. Every second of latency on a live call is a second the prospect loses patience. Warm pools like this matter way more than benchmark scores for anyone running agents in production.

English

1.9K

OpenAI Developers@OpenAIDevs·21 Mar

Agent workflows got even faster. You can spin up containers for skills, shell and code interpreter about 10x faster. We added a container pool to the Responses API, so requests can reuse warm infrastructure instead of creating a full container creation each session. #hosted-shell-quickstart" target="_blank" rel="nofollow noopener">developers.openai.com/api/docs/guide…

English

108

151

2.1K

173.9K

Sammy Milton-Tomkins@Miltonsammy_·3d

@boardyai Dedicated GPU infrastructure for AI teams @NexaCoreio Need access to clients that are actually currently struggling. Thanks.

English

Boardy@boardyai·3d

Founders, drop what you're working on and what you need (capital, hiring, etc.) I'll make the right intros.

English

280

221

20.5K

Sammy Milton-Tomkins@Miltonsammy_·3d

@PieroHerrera1 @songjunkr Yes this tends to happen when demand spikes and you’re sitting on shared allocation. It looks fine until everyone hits it all at once, then latency just totally collapses. Do you seem to be seeing this consistently now or manly at peak times?

English

Piero Herrera 필호@PieroHerrera1·4d

@songjunkr And the inference is really slow this weekend. Seems like they still have a lot of demand, and they just reset limits globally on Friday so it’s absurdly slow. Switching to codex as well

English

557

송준 Jun Song@songjunkr·4d

제생각에는 GPT-5.5는 현시점 최고의 모델인듯 합니다. 가격으로는 Deepseek V4 Pro. Claude는 이제 장점이 무엇인지 정말 모르겠어요.

한국어

111

859

94.4K

Sammy Milton-Tomkins@Miltonsammy_·3d

@ariccio @AdvancedTweaker @michael_hoerger That’s where it usually starts getting real. Once you move from isolated runs to continuous loops, the time cost compounds fast, especially on inference. Are you running this on shared infra or something more dedicated?

English

Alexander Riccio (@co2trackers)@ariccio·4d

@AdvancedTweaker @michael_hoerger It's been cooking like that for 6 hours. About 2 hours or so is probably lost to slow ios simulator test runs and slow swift builds, but the other 4 is inference. It's the equivalent of tens of thousands of messages with a web chat bot

English

Sammy Milton-Tomkins@Miltonsammy_·3d

@HarveenChadha Feels like a lot of teams only realise this once they actually try to run things at scale. Talking about agentic AI is easy in you hit real constraints on memory, throughput and allocation. Are you seeing teams in your network actually struggle with this yet or still theoretical?

English

Harveen Singh Chadha@HarveenChadha·6d

I am so surprised none of the Indian tech leaders are obsessed about compute The US stocks are playing on themes like gpu shortage, memory shortage, cpu shortage while here we are so chill just by talking about applying agentic ai where we own no part of the stack

English

43.6K

Sammy Milton-Tomkins@Miltonsammy_·3d

@AGNonX Feels like a lot of teams are moving local just to escape allocation issues, but then hit limits again once workloads grow. Do you seem to actually be seeing these setups hold under sustained inference or more for controlled use cases?

English

Aunt Gladys Nephew@AGNonX·5d

The #LocalAI hardware shortage is here. Mac Minis and Mac Studios are sold out everywhere and getting flipped on eBay at massive markups right now. This is the GPU shortage all over again, except this time it's #AppleSilicon.

English

Sammy Milton-Tomkins@Miltonsammy_·3d

@adlrocha That makes sense, we have seen teams hit that same wall where “vibe checks” become the only thing catching real failures. it starts breaking once workloads run continuously rather than in isolated evals. Have you looked at pushing more of that validation under sustained load?

English

adlrocha@adlrocha·4d

We have some automation in place, but there's still a human-in-the-loop stage on our staging env that check the "vibes" of the release. This is too manual to my taste, but is the only way of catching some regressions. We are working to make it better, I may write about this in my next post, actually :) Any ideas from your side more than welcome

English

adlrocha@adlrocha·21 Nis

Testing AI agents is hard, and it requires a three-layer approach: 1️⃣ Scaffolding unit tests, deterministic checks for your orchestration logic 2️⃣ Issue-tagged regression tests, real bugs that broke production 3️⃣ LLM-as-judge + task-based evals to measure actual capability The open problem: detecting small prompt regressions that break complex workflow This week is already booked, but will write about my experience working on these problems in two weeks in the newsletter. Subscribe to stay tuned!

English

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@AlexanderKalian @P33RL3SS Fair take. The gap usually shows up once systems move from demos to continuous workloads, that’s where the real constraints a tradeoffs become unavoidable. Curious what you have seen break first in practice?

English

Dr Alexander D. Kalian@AlexanderKalian·22 Nis

Cancer vaccines work by training the immune system to recognise and attack the mutated cancer cells. Trouble is, different cancers have different mutations and hence different antigens - requiring different vaccines. Cancer vaccines, especially if personalised, will save millions of lives - maybe even billions - but one given vaccine cannot be a standalone cure for all cancers.

English

Dr Alexander D. Kalian@AlexanderKalian·22 Nis

I can guarantee that AI will not "cure cancer" - at least, not in any clean singular way. Cancer is an umbrella term for many different diseases, affecting different tissues, with different pathologies and treatment pathways - each requiring different cures. And this is before we discuss the inherent challenges faced by AI drug discovery, which are unlikely to be resolved anytime soon. This "AI will cure cancer" narrative among AI utopianists, demonstrates a mixture of overconfidence, ignorance, and naivety - about both the capabilities of AI, and the applied domain of biology that they so naively delve into.

djcows@djcows

if AI cures cancer, will the anti-AI people still hate AI?

English

257

26.2K

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@adelbucetta @ai_with_shah @NanoBanana Agreed, a lot of that “complexity” end up being hidden infra constraints, teams scale usage but the underlying capacity and scheduling don’t scale cleanly with it. That’s usually where the real instability starts.

English

Adel Bucetta@adelbucetta·23 Nis

@ai_with_shah @NanoBanana most people think ai solves this, but it just accelerates the problem: maintenance costs, complexity, and scaling nightmares don't disappear

English

Shah@ai_with_shah·22 Nis

Nano Banana Pro 🍌 Prompt share 👇 The image is divided into three clean horizontal panels with no text. Top panel: beginning of a dynamic karate kick in a minimalist dojo, mid-motion setup. Middle panel: action in progress with powerful extension and fabric flow. Bottom panel: conclusion with balanced landing and focused expression. Clean line work blended with photorealistic details, consistent character across panels, high-contrast studio lighting, professional storyboard aesthetic for game or animation reference.

English

921

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@seoinetru @fal @Fal_ai That kind of jump usually isn’t the model itself, it’s queuing or shared capacity under load. Have you noticed if it spikes at certain times or just stays elevated? We have seen similar where latency quietly degrades once utilisation crosses a threshold.

English

NeuralMyth@seoinetru·10 Nis

@fal @fal_ai Urgent issue with fal-ai/turbo-flux-trainer 🚨 Since Jan 27, training times jumped from 30s to 20-40 mins. Our UX depends on speed, and we're paying for Turbo specifically for that low latency. Is there a backend issue or queue limit affecting this model? Help

English

196

fal@fal·9 Nis

Seedance 2.0 is now available to everyone without any restrictions! fal.ai/models/bytedan…

English

700

151.3K

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@jacobbednarz That latency point is interesting, are you seeing it stay stable once workloads run continuously, or does it drift under sustained load? We have a few cases where things look fine initially then degrade quietly over time.

English

Jacob⚡️Bednarz@jacobbednarz·16 Nis

this week has been the first time i've ever felt like we're finally "working in the future". - while dropping the kids to daycare, i had AI review and dissect a latency issue i tracked down - while sleeping, i've had my 3d printer creating new toolbox organisers - started feeding camera footage of our livestock into an agricultural model for detecting sick, injured or underfeeding patterns - while shipping a feature for work, had AI debug why one of my unifi access points randomly doesn't allow clients to stay connected it's probably not the hottest use of autonomous processss these days but damn, it feels good.

English

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@GetPowerAI Exactly. Starts as scheduling, but once workloads stay hot, infra mismatches show fast. We have seen similar where things look fine until continuous load exposes it. Do you seem to be seeing this more region specific or more across deployments?

English

Power AI@GetPowerAI·23 Nis

@Miltonsammy_ It starts as scheduling. It turns into infrastructure. When workloads go continuous, small mismatches between compute and power get amplified. What looks like a scheduling issue is often underlying grid constraints, price volatility, or local capacity limits.

English

Power AI@GetPowerAI·20 Nis

AI demand is rapidly outpacing available computing infrastructure, with companies facing shortages of GPUs, rising costs, and capacity constraints as usage shifts toward continuous, agent-driven workloads. The deeper issue is that scaling AI is no longer just a software challenge, but a physical one, where compute, data centers, and energy infrastructure are becoming tightly coupled and increasingly limited. x.com/GetPowerAI/sta…

English

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@Samward @Konstantine And that’s the dangerous part, silent failure. Looks stable on surface while compute is wasted underneath. We have seen similar where retries/ loops mask issues for hours. Do you seem to be instrumenting at step level or still mostly aggregate?

English

Sam Ward@Samward·23 Nis

@Miltonsammy_ @Konstantine Right. Orchestration failures don't look like inference failures either. A bad model answer is visible. An orchestration layer silently re-running the wrong step for three hours is not. Most teams don't have the telemetry to catch the second one yet.

English

Konstantine Buhler@Konstantine·16 Nis

CPUs are coming back in a big way! Proud to work with some of the very best in the industry.

NUVACORE@NUVACOREAI

Engineered for Altitude. CPUs iterated for decades. AI broke the model. Founded by Gerard Williams, John Bruno, and Ram Srinivasan—backed by @sequoia Capital—NUVACORE is building a new class of CPU for maximum performance and efficiency. We’re hiring: nuvacore.ai

English

3.1K

Sammy Milton-Tomkins@Miltonsammy_·24 Nis

@adlrocha Exactly, defining success gets somewhat blurry as tasks broaden. We’ve seen team default to proxies that don’t hold under real usage. Are you relying more on human eval loops now or trying to systematise it fully?

English

adlrocha@adlrocha·22 Nis

I think one of the worst issues we are seeing (at least for our use case) is how to actually objectively define success for an agentic task. We are dealing with data analysis and for small narrow tasks determining if the result is correct is easy, but with a vast catalog and broad analyses it becomes harder. These also involve larger tasks, which means that depending on the model you can clearly see the degradation

English

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@mpetyx Agreed. Routing helps early, but as usage scales the challenge shifts into maintaining consistency across those systems. That’s where orchestration stops being optimisation and becomes an infrastructure problem.

English

Michael Petychakis@mpetyx·14 Nis

The answer isn't waiting for prices to drop. It's orchestration. AT&T cut AI costs by 90% and tripled throughput — not by using less AI, but by routing tasks to right-sized models instead of pushing everything through frontier. Three moves that matter now: → Build a dedicated AI compute budget (stop raiding existing line items) → Instrument cost per task, per workflow, per outcome → Portfolio your model spend — frontier for high-stakes, open-weight for volume The math is moving. Start doing it.

English

168

Michael Petychakis@mpetyx·14 Nis

The OPEX shift from headcount to AI tokens isn't a future prediction — it's happening now. But the math is broken. CEOs want 3x velocity. GMs are fighting for unbudgeted token spend mid-quarter. Top engineers are consuming more tokens than the rest of the org combined. We're paying a premium for capability instead of getting cost arbitrage. Here's what's actually going on. 🧵👇

English

271

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@1a1n1d1y @0xsachi 20% helps, but most teams find the real constraint isn’t throughput, it’s consistency under load. Once systems scale, small gains get eaten quickly if the underlying infra isn’t stable.

English

andy@1a1n1d1y·19 Nis

@0xsachi what if i could make inference 20% greater throughput for the same existing hardware, is that good? also check out tpu… the price drop you’re looking for is hidden behind 500 lines of jax

English

Miss Sentient@0xsachi·19 Nis

What if it’s possible to build Mythos with open-source models?

English

2.7K

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@Samward @Konstantine Exactly. Most expect scaling pressure on inference, but it shifts into orchestration and system coordination fast. That’s where consistency and reliability start breaking before people realise what’s actually happening.

English

Sam Ward@Samward·17 Nis

The CPU story is under covered. We run legal agents on a mix of GPU for model inference and CPU for the orchestration layer that actually decides what the agent should do next. As agents scale, the bottleneck moves from raw tokens per second to the decision engine around them. CPUs tuned for that workload would change the economics.

English

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@KislayParashar1 @jukan05 Agreed. Most focus on GPU count, but the real constraint shows up in how the system behaves under load. Bandwidth, coordination, and stability become the bottleneck long before raw compute does.

English

Kislay Parashar@KislayParashar1·20 Nis

@jukan05 This is already happening quietly. CPU memory bandwidth, not compute, is the actual bottleneck in heavy inference workloads. Going from 1 CPU per 12 GPUs to 2 CPUs per GPU is a massive architectural shift nobody is talking about enough.

English

399

Jukan@jukan05·20 Nis

My bold prediction: within the next two years, there will be cases where GPUs cannot be deployed because of CPU shortages. According to industry checks, in some Rubin Ultra configurations, the GPU-to-CPU deployment ratio has already exceeded one GPU to two CPUs.

English

165

664.2K

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@ValeriusLabs @sama Exactly. Most underestimate how fast inference cost stops being a pricing problem and becomes a capacity and stability problem. Once usage scales, securing reliable compute becomes the constraint, not demand.

English

268

VALAI@ValeriusLabs·22 Nis

@sama Cursor's $60B valuation assumes inference margins stay fat. spacex compute actually costs money to run. once they're burning gpu cycles on model serving, that acquisition math flips hard.

English

1.6K

Sam Altman@sama·22 Nis

We want you to have a lot of AI!

Tibo@thsottiaux

I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them. Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world.

English

583

175

5.3K

756.4K

Sammy Milton-Tomkins@Miltonsammy_·22 Nis

@linuxquestions @datadoghq Most teams underestimate how quickly orchestration complexity turns into instability. Capacity and consistency become the real constraint long before models do. Curious how many are actually planning infra at that level yet.

English

Jeremy@linuxquestions·21 Nis

The first @datadoghq report on AI/LLMs just dropped. It explores the state of AI engineering in production. One thing struck me. As the ecosystem matures and real LLM-based systems are in production longer, these systems start to look more and more like the distributed systems we already know. The overlap isn't 100% of course, but routing, dependencies, budgets, capacity planning, tech debt, and unanticipated failure modes... a lot of the patterns look familiar. datadoghq.com/state-of-ai-en…

English

Keşfet

@osttoo @OpenAIDevs @boardyai @NexaCoreio @PieroHerrera1 @songjunkr @ariccio @AdvancedTweaker