Yashsmith shah

343 posts

Yashsmith shah

@Yashsmith_dev

Just an engineer trying to turn curiosity into code | 8x hackathon winner incl NVIDIA AI hack | Tech club chair | currently in the GenAI & startup rabbit hole

San Francisco, CA Bergabung Kasım 2021

568 Mengikuti76 Pengikut

Yashsmith shah@Yashsmith_dev·1 May

@sama how will they pay for the increasing model prices ? will it be free ? 😁

English

Sam Altman@sama·1 May

i'm hopeful for a future where people who want to work really hard have incredibly fulfilling things to do, and people who don't want to work hard don't have to and can still have an amazing life of prosperity.

English

325

148

260.2K

Sam Altman@sama·1 May

we want to build tools to augment and elevate people, not entities to replace them.

English

2.4K

755

11.7K

3.1M

Yashsmith shah@Yashsmith_dev·1 May

@ShubhamInTech love the concept !!! if you want to power this with open source models and skip the cold start problems, check out @InferX any model, sub second wake times, zero infra stress

English

shubham@ShubhamInTech·1 May

Introducing Agnost AI. The first infrastructure that lets your AI agents learn from every user. It catches every raw intent, connects it back to what your agent did wrong, and improves the agent. Autonomously. Because with AI agents, there's no complaint, just churn.

English

286

23.1K

Yashsmith shah@Yashsmith_dev·1 May

@AnhPhuNguyen1 if you dont want to manage infra and don't want cold starts (most important for your product) you can always look at @InferXai , sub second cold starts , any model integration you ask for

English

AnhPhu Nguyen@AnhPhuNguyen1·30 Nis

with Mira, AI can now live on your face. capture every conversation. create the most personalized form of AI ever. order now.

English

457

331

3.2K

2.4M

Yashsmith shah@Yashsmith_dev·1 May

@im_roy_lee we at @InferXai have the infra to serve models who hold those insane contexts with sub second cold starts. exactly for your use case at InferX

English

Roy@im_roy_lee·30 Nis

we tried to build this ourselves at cluely, but it's a hard problem to solve fundamentally the problem is with the models for proactive ai glasses (or even just proactive ai), you need > sub 300ms response time > continuous visual scene understanding > super high precision on "when to help" > insane context window > ultra low-friction output (a text blob is high friction) and like 10 other core features that just don't exist yet impressed that this team does not stop and are constantly finding workarounds to try and win an obvious future form factor

AnhPhu Nguyen@AnhPhuNguyen1

with Mira, AI can now live on your face. capture every conversation. create the most personalized form of AI ever. order now.

English

138

2.5K

483.4K

Yashsmith shah@Yashsmith_dev·1 May

@Techmeme @dinabass throwing $ 615m at silicon proves how deep the pain point is. the next massive leap will be redefining the inference architecture itself. we at @InferXai are already on it.

English

Yashsmith shah@Yashsmith_dev·1 May

@ycombinator @sdianahu building new silicon for a serving and runtime issue is overkill inference is not a chip problem, it is a system problem agents need smarter scheduling and state management at the infra level to fix that 30-40% utilization gap that is exactly what we are solving at @InferXai

English

Y Combinator@ycombinator·27 Nis

Inference Chips for Agent Workflows @sdianahu Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result. That gap is where purpose-built silicon wins.

English

403

706K

Y Combinator@ycombinator·27 Nis

AI has stopped being a feature and started being the foundation. We're excited about a new wave of startups rebuilding software, services, and silicon— and pushing AI into the physical world. ycombinator.com/rfs

English

210

956

8.9K

4.4M

Yashsmith shah@Yashsmith_dev·1 May

@arankomatsuzaki 100%. building custom silicon for a serving runtime issue is wild. inference is a system problem, not a chip problem. that is exactly the layer we are building at @InferXai

English

Aran Komatsuzaki@arankomatsuzaki·28 Nis

This feels like confusing a serving-runtime problem for a chip-startup opportunity. Agents do change inference patterns: loops, tool calls, branching, long context, KV reuse, burstiness. But most of that is an inference systems problem: scheduling, routing, KV-cache management, etc. Think Dynamo. By the time a new chip co tapes out + builds a compiler stack + wins cloud distribution, NVIDIA/AMD will likely have baked the obvious hardware-level optimizations into existing platforms.

Y Combinator@ycombinator

English

27.5K

Yashsmith shah@Yashsmith_dev·1 May

@0xSero @badlogicgames metrics and logs are crucial, but the bugs definitely aren't. @InferXai handles the serverless GPU inference smoothly so you can skip the setup headaches. might be a solid alternative for your workflows

English

0xSero@0xSero·27 Nis

once vllm-studio is less stop it should do just this, I mean it works for me but onboarding is non-existent and it has tons of bugs from months of vibecoding. It works though, this is how I run all my inference workflows. - Recipes (cached settings) - Usage metrics - Server logs - GPU and inference data

English

1.6K

Mario Zechner@badlogicgames·27 Nis

It would be lovely if inference engines would have endpoints that advertise: - all locally cached models and their specs (input modalities, context window size, thinking, tool caling) - loaded model(s) that would allow harnesses to easily enumerate things dynamically, instead of having users (or their agents) write silly config files.

English

163

13.1K

Yashsmith shah@Yashsmith_dev·1 May

@TTT_Shogun spot on. if you're looking to power that OS with open-source models on serverless GPUs, check out @InferXai an execution layer needs zero cold starts to feel native, and we have that covered. DM me if you need any help

English

TTT@TTT_Shogun·26 Nis

YC just made it official: AI-native companies will run on closed-loop, queryable systems. The same shift is happening one layer down. The AI-native individual needs a queryable life and an execution layer that closes the loop. That's the OS we're building. youtube.com/watch?si=tledo… @ycombinator

YouTube

English

242

Yashsmith shah@Yashsmith_dev·1 May

@AradhyeAgarwal hey , I work at @InferXai and they have already been doing serverless inference for agentic workflows , you can build on top of it , DM me if you want to partner with us

English

508

Aradhye Agarwal@AradhyeAgarwal·1 May

I'm applying to YC with the "Inference Chips for Agentic Workflows" idea. Looking for co-founders with a background in ML systems. DMs are open.

English

186

19.8K

Yashsmith shah@Yashsmith_dev·23 Nis

@PMV_InferX @huggingface interesting!

English

Prashanth (Manohar) Velidandi@PMV_InferX·22 Nis

Saying “open source has to win” is useless. If it’s going to win, it has to be usable. Kimi K2.6 is already close to GPT-level quality. Model quality isn’t the bottleneck anymore. Access is. 3M models on @huggingface . ~1% have real inference support. What’s the point? We are talking about it on how we are changing on April 28th. Live Demo • Q&A luma.com/w22edapy

English

Yashsmith shah@Yashsmith_dev·11 Nis

@TheAhmadOsman U can always get more on @InferXai 2 h100s 👀

English

Ahmad@TheAhmadOsman·11 Nis

I don’t have enough GPUs

English

212

10.3K

Yashsmith shah@Yashsmith_dev·11 Nis

@AlexEngineerAI gwen distilled by Opus 4.6 inferenced by @InferXai truly is the best cost to quality ratio out there also got 30$ worth of free credits

English

Alex the Engineer@AlexEngineerAI·10 Nis

If you could only use one open source model for the rest of 2026, which one are you picking? I'm torn between Qwen3 and the new Gemma 4. Drop your stack in the replies. 👇

English

2.2K

Yashsmith shah@Yashsmith_dev·9 Nis

@thdxr he math is right, but it ignores utilization. Step 2 falls apart if your GPUs sit idle during 2 minute cold starts. You need instant routing to make the economics work. This is exactly why the sub-second H100 spin ups on @InferXai actually matter for the bottom line.

English

279

dax@thdxr·9 Nis

inference is very profitable and probably a good opportunity to understand some basic business math 1. companies buy long lived assets like GPUs. these are one time costs and the asset depreciates over time 2. once you own this asset, you can plug it in and produce tokens which you can sell. the cost of goods sold here can be very low and you might be making 90% margins at scale, this is why we say inference is profitable 3. then you also hire employees to do r&d work to improve your systems, come up with new models, expand the business if you add these 3 up you end up with $0. you're not producing a profit because the business is growing and you're reinvesting it all buying assets or r&d to meet demand if it's obvious to other people the business is working, you can raise money from them to accelerate all these numbers so they max out in 5 years instead of 25 so on paper you'll be "losing money" every year but that's because you want to make sure you lock down the opportunity before someone else the bigger your market is the bigger this burn can be because it's a function of potential so when you see these companies losing a lot of money it doesn't mean the whole concept of their business broken it's possible they misjudge and overinvest on 1+3 and will suffer some consequences but fundamentally 2 does work

dax@thdxr

@d4m1n i'm a bit confused why so many people say api tokens are sold at a loss this isn't true - these models are incredibly expensive compared to the gpu time cost there's potential for 90% margin depending on the model

English

1.4K

151.5K

Yashsmith shah@Yashsmith_dev·9 Nis

Had a great and informative meet with @MrTalkStock ! never thought a person with more than 250k followers across all social media would be so nice to hop on a call to guide and help always grateful to all the mentors

English

Yashsmith shah@Yashsmith_dev·6 Nis

@HarshithLucky3 its really great , u can test it out on @InferXai , they give around 30$ worth of free credits and 2 h100 gpu's to test it out yourself and integrate into your application

English

Harshith@HarshithLucky3·6 Nis

WAIT WHAT Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is trending in top 2 on Hugging Face Is it good distilled model?

English

827

132.8K

Yashsmith shah me-retweet

Harsh Chandramania@HChandramania·5 Nis

@Spotify @spotifyindia a solution to this would be really helpful. Would appreciate if any of my followers here could repost this. P.S. had to post it as images because X won't just let me type the whole thing :(

English

Yashsmith shah@Yashsmith_dev·6 Nis

@outsource_ Fitting weights in 16GB is cool, but agent loops require huge context windows that will instantly OOM that setup. The quantization hit is not worth it. I just use this exact Qwen variant on @InferXai One click for the uncompromised version

English

Eric ⚡️ Building...@outsource_·31 Mar

🚀 Imagine running Claude 4.6 Opus-level reasoning... but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit quantization (Q4_K_M) while slashing chain-of-thought bloat by 24%. ✅ Retains 96.91% HumanEval accuracy ✅ Perfect for agentic coding loops (no API costs or latency) 300K+ downloads on HF Link below 👇🏻

English

291

3.4K

280.3K

Yashsmith shah@Yashsmith_dev·6 Nis

@XunWallace The math does not work. A 4090 has 24GB VRAM, but a 70B model needs ~40GB You will still bottleneck offloading to Mac RAM This is why I use @InferXai now Sub-second cold starts on dual H100s feel completely local but actually hold the model

English

Rocky 🪨@XunWallace·2 Nis

This changes the economics of local AI inference completely. Mac Mini ($600) + Thunderbolt eGPU enclosure ($200) + RTX 4090 ($1600) = a $2400 local AI workstation running 70B+ parameter models at full GPU speed. Before this, Apple Silicon users were stuck with unified memory inference — fast for small models, painfully slow for anything serious. Now you get CUDA/ROCm performance on macOS without dual-booting Linux. tinygrad bypassed Apple's metal-only GPU restriction entirely. The real unlock: AI agents that need local compute (privacy-sensitive, low-latency, or offline) just got dramatically cheaper to deploy. tinygrad.org

the tiny corp@__tinygrad__

If you have a Thunderbolt or USB4 eGPU and a Mac, today is the day you've been waiting for! Apple finally approved our driver for both AMD and NVIDIA. It's so easy to install now a Qwen could do it, then it can run that Qwen...

English

1.4K

Yashsmith shah@Yashsmith_dev·5 Nis

@Amank1412 inferX = GPU Inference (Fastest)

Indonesia

106

Aman@Amank1412·4 Nis

Claude = coding. ($20/mo) Supabase = backend. (Free) Vercel = deploying. (Free) Namecheap = domain. ($12/ yr) Stripe = payments. (2.9%/ transaction) GitHub = version control. (Free) Resend = emails. (Free) Clerk = auth. (Free) Cloudflare = DNS. (Free) PostHog = analytics. (Free) Sentry = error tracking. (Free) Upstash = Redis. (Free) Pinecone = vector DB. (Free) For just $20/month, you can launch a fully functional startup

English

508

24.7K

Jelajahi

@sama @ShubhamInTech @InferX @AnhPhuNguyen1 @InferXai @im_roy_lee @Techmeme @dinabass