
Sammy Milton-Tomkins
73 posts

Sammy Milton-Tomkins
@Miltonsammy_
Founder @NexaCoreio Dedicated GPU infrastructure for AI teams
England, United Kingdom Katılım Mart 2026
257 Takip Edilen21 Takipçiler

@osttoo @OpenAIDevs This usually shows up when the infra layer isn’t keeping pace with real time demand.
Do you seem to be seeing this more from scaling load or from how the workloads are being scheduled?
English

@OpenAIDevs Container cold starts were killing us on our AI sales platform. Every second of latency on a live call is a second the prospect loses patience. Warm pools like this matter way more than benchmark scores for anyone running agents in production.
English

Agent workflows got even faster.
You can spin up containers for skills, shell and code interpreter about 10x faster.
We added a container pool to the Responses API, so requests can reuse warm infrastructure instead of creating a full container creation each session.
#hosted-shell-quickstart" target="_blank" rel="nofollow noopener">developers.openai.com/api/docs/guide…
English

@boardyai Dedicated GPU infrastructure for AI teams @NexaCoreio
Need access to clients that are actually currently struggling. Thanks.
English

@PieroHerrera1 @songjunkr Yes this tends to happen when demand spikes and you’re sitting on shared allocation. It looks fine until everyone hits it all at once, then latency just totally collapses.
Do you seem to be seeing this consistently now or manly at peak times?
English

@songjunkr And the inference is really slow this weekend. Seems like they still have a lot of demand, and they just reset limits globally on Friday so it’s absurdly slow. Switching to codex as well
English

@ariccio @AdvancedTweaker @michael_hoerger That’s where it usually starts getting real.
Once you move from isolated runs to continuous loops, the time cost compounds fast, especially on inference.
Are you running this on shared infra or something more dedicated?
English

@AdvancedTweaker @michael_hoerger It's been cooking like that for 6 hours. About 2 hours or so is probably lost to slow ios simulator test runs and slow swift builds, but the other 4 is inference. It's the equivalent of tens of thousands of messages with a web chat bot
English

@HarveenChadha Feels like a lot of teams only realise this once they actually try to run things at scale. Talking about agentic AI is easy in you hit real constraints on memory, throughput and allocation. Are you seeing teams in your network actually struggle with this yet or still theoretical?
English

@AGNonX Feels like a lot of teams are moving local just to escape allocation issues, but then hit limits again once workloads grow.
Do you seem to actually be seeing these setups hold under sustained inference or more for controlled use cases?
English

The #LocalAI hardware shortage is here. Mac Minis and Mac Studios are sold out everywhere and getting flipped on eBay at massive markups right now. This is the GPU shortage all over again, except this time it's #AppleSilicon.
English

@adlrocha That makes sense, we have seen teams hit that same wall where “vibe checks” become the only thing catching real failures. it starts breaking once workloads run continuously rather than in isolated evals. Have you looked at pushing more of that validation under sustained load?
English

We have some automation in place, but there's still a human-in-the-loop stage on our staging env that check the "vibes" of the release. This is too manual to my taste, but is the only way of catching some regressions.
We are working to make it better, I may write about this in my next post, actually :) Any ideas from your side more than welcome
English

Testing AI agents is hard, and it requires a three-layer approach:
1️⃣ Scaffolding unit tests, deterministic checks for your orchestration logic
2️⃣ Issue-tagged regression tests, real bugs that broke production
3️⃣ LLM-as-judge + task-based evals to measure actual capability
The open problem: detecting small prompt regressions that break complex workflow
This week is already booked, but will write about my experience working on these problems in two weeks in the newsletter. Subscribe to stay tuned!
English

@AlexanderKalian @P33RL3SS Fair take. The gap usually shows up once systems move from demos to continuous workloads, that’s where the real constraints a tradeoffs become unavoidable.
Curious what you have seen break first in practice?
English

Cancer vaccines work by training the immune system to recognise and attack the mutated cancer cells.
Trouble is, different cancers have different mutations and hence different antigens - requiring different vaccines.
Cancer vaccines, especially if personalised, will save millions of lives - maybe even billions - but one given vaccine cannot be a standalone cure for all cancers.
English

I can guarantee that AI will not "cure cancer" - at least, not in any clean singular way.
Cancer is an umbrella term for many different diseases, affecting different tissues, with different pathologies and treatment pathways - each requiring different cures.
And this is before we discuss the inherent challenges faced by AI drug discovery, which are unlikely to be resolved anytime soon.
This "AI will cure cancer" narrative among AI utopianists, demonstrates a mixture of overconfidence, ignorance, and naivety - about both the capabilities of AI, and the applied domain of biology that they so naively delve into.
djcows@djcows
if AI cures cancer, will the anti-AI people still hate AI?
English

@adelbucetta @ai_with_shah @NanoBanana Agreed, a lot of that “complexity” end up being hidden infra constraints, teams scale usage but the underlying capacity and scheduling don’t scale cleanly with it. That’s usually where the real instability starts.
English

@ai_with_shah @NanoBanana most people think ai solves this, but it just accelerates the problem: maintenance costs, complexity, and scaling nightmares don't disappear
English

Nano Banana Pro 🍌
Prompt share 👇
The image is divided into three clean horizontal panels with no text. Top panel: beginning of a dynamic karate kick in a minimalist dojo, mid-motion setup. Middle panel: action in progress with powerful extension and fabric flow. Bottom panel: conclusion with balanced landing and focused expression. Clean line work blended with photorealistic details, consistent character across panels, high-contrast studio lighting, professional storyboard aesthetic for game or animation reference.



English

@seoinetru @fal @Fal_ai That kind of jump usually isn’t the model itself, it’s queuing or shared capacity under load. Have you noticed if it spikes at certain times or just stays elevated?
We have seen similar where latency quietly degrades once utilisation crosses a threshold.
English

Seedance 2.0 is now available to everyone without any restrictions!
fal.ai/models/bytedan…
English

@jacobbednarz That latency point is interesting, are you seeing it stay stable once workloads run continuously, or does it drift under sustained load?
We have a few cases where things look fine initially then degrade quietly over time.
English

this week has been the first time i've ever felt like we're finally "working in the future".
- while dropping the kids to daycare, i had AI review and dissect a latency issue i tracked down
- while sleeping, i've had my 3d printer creating new toolbox organisers
- started feeding camera footage of our livestock into an agricultural model for detecting sick, injured or underfeeding patterns
- while shipping a feature for work, had AI debug why one of my unifi access points randomly doesn't allow clients to stay connected
it's probably not the hottest use of autonomous processss these days but damn, it feels good.
English

@GetPowerAI Exactly. Starts as scheduling, but once workloads stay hot, infra mismatches show fast. We have seen similar where things look fine until continuous load exposes it.
Do you seem to be seeing this more region specific or more across deployments?
English

@Miltonsammy_ It starts as scheduling. It turns into infrastructure.
When workloads go continuous, small mismatches between compute and power get amplified. What looks like a scheduling issue is often underlying grid constraints, price volatility, or local capacity limits.
English

AI demand is rapidly outpacing available computing infrastructure, with companies facing shortages of GPUs, rising costs, and capacity constraints as usage shifts toward continuous, agent-driven workloads.
The deeper issue is that scaling AI is no longer just a software challenge, but a physical one, where compute, data centers, and energy infrastructure are becoming tightly coupled and increasingly limited.
x.com/GetPowerAI/sta…
English

@Samward @Konstantine And that’s the dangerous part, silent failure. Looks stable on surface while compute is wasted underneath. We have seen similar where retries/ loops mask issues for hours.
Do you seem to be instrumenting at step level or still mostly aggregate?
English

@Miltonsammy_ @Konstantine Right. Orchestration failures don't look like inference failures either. A bad model answer is visible. An orchestration layer silently re-running the wrong step for three hours is not. Most teams don't have the telemetry to catch the second one yet.
English

CPUs are coming back in a big way! Proud to work with some of the very best in the industry.
NUVACORE@NUVACOREAI
Engineered for Altitude. CPUs iterated for decades. AI broke the model. Founded by Gerard Williams, John Bruno, and Ram Srinivasan—backed by @sequoia Capital—NUVACORE is building a new class of CPU for maximum performance and efficiency. We’re hiring: nuvacore.ai
English

@adlrocha Exactly, defining success gets somewhat blurry as tasks broaden. We’ve seen team default to proxies that don’t hold under real usage.
Are you relying more on human eval loops now or trying to systematise it fully?
English

I think one of the worst issues we are seeing (at least for our use case) is how to actually objectively define success for an agentic task. We are dealing with data analysis and for small narrow tasks determining if the result is correct is easy, but with a vast catalog and broad analyses it becomes harder.
These also involve larger tasks, which means that depending on the model you can clearly see the degradation
English

@mpetyx Agreed. Routing helps early, but as usage scales the challenge shifts into maintaining consistency across those systems. That’s where orchestration stops being optimisation and becomes an infrastructure problem.
English

The answer isn't waiting for prices to drop. It's orchestration.
AT&T cut AI costs by 90% and tripled throughput — not by using less AI, but by routing tasks to right-sized models instead of pushing everything through frontier.
Three moves that matter now: → Build a dedicated AI compute budget (stop raiding existing line items) → Instrument cost per task, per workflow, per outcome → Portfolio your model spend — frontier for high-stakes, open-weight for volume
The math is moving. Start doing it.
English

The OPEX shift from headcount to AI tokens isn't a future prediction — it's happening now. But the math is broken.
CEOs want 3x velocity. GMs are fighting for unbudgeted token spend mid-quarter. Top engineers are consuming more tokens than the rest of the org combined.
We're paying a premium for capability instead of getting cost arbitrage.
Here's what's actually going on. 🧵👇
English

@Samward @Konstantine Exactly. Most expect scaling pressure on inference, but it shifts into orchestration and system coordination fast. That’s where consistency and reliability start breaking before people realise what’s actually happening.
English

The CPU story is under covered. We run legal agents on a mix of GPU for model inference and CPU for the orchestration layer that actually decides what the agent should do next. As agents scale, the bottleneck moves from raw tokens per second to the decision engine around them. CPUs tuned for that workload would change the economics.
English

@KislayParashar1 @jukan05 Agreed. Most focus on GPU count, but the real constraint shows up in how the system behaves under load. Bandwidth, coordination, and stability become the bottleneck long before raw compute does.
English

@jukan05 This is already happening quietly. CPU memory bandwidth, not compute, is the actual bottleneck in heavy inference workloads. Going from 1 CPU per 12 GPUs to 2 CPUs per GPU is a massive architectural shift nobody is talking about enough.
English

@ValeriusLabs @sama Exactly. Most underestimate how fast inference cost stops being a pricing problem and becomes a capacity and stability problem. Once usage scales, securing reliable compute becomes the constraint, not demand.
English


@linuxquestions @datadoghq Most teams underestimate how quickly orchestration complexity turns into instability. Capacity and consistency become the real constraint long before models do.
Curious how many are actually planning infra at that level yet.
English

The first @datadoghq report on AI/LLMs just dropped. It explores the state of AI engineering in production. One thing struck me. As the ecosystem matures and real LLM-based systems are in production longer, these systems start to look more and more like the distributed systems we already know. The overlap isn't 100% of course, but routing, dependencies, budgets, capacity planning, tech debt, and unanticipated failure modes... a lot of the patterns look familiar.
datadoghq.com/state-of-ai-en…
English







