Yashsmith shah

343 posts

Yashsmith shah banner
Yashsmith shah

Yashsmith shah

@Yashsmith_dev

Just an engineer trying to turn curiosity into code | 8x hackathon winner incl NVIDIA AI hack | Tech club chair | currently in the GenAI & startup rabbit hole

San Francisco, CA Bergabung Kasım 2021
568 Mengikuti76 Pengikut
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@sama how will they pay for the increasing model prices ? will it be free ? 😁
English
0
0
1
52
Sam Altman
Sam Altman@sama·
i'm hopeful for a future where people who want to work really hard have incredibly fulfilling things to do, and people who don't want to work hard don't have to and can still have an amazing life of prosperity.
English
325
148
3K
260.2K
Sam Altman
Sam Altman@sama·
we want to build tools to augment and elevate people, not entities to replace them.
English
2.4K
755
11.7K
3.1M
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@ShubhamInTech love the concept !!! if you want to power this with open source models and skip the cold start problems, check out @InferX any model, sub second wake times, zero infra stress
English
0
0
1
13
shubham
shubham@ShubhamInTech·
Introducing Agnost AI. The first infrastructure that lets your AI agents learn from every user. It catches every raw intent, connects it back to what your agent did wrong, and improves the agent. Autonomously. Because with AI agents, there's no complaint, just churn.
English
42
30
286
23.1K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@AnhPhuNguyen1 if you dont want to manage infra and don't want cold starts (most important for your product) you can always look at @InferXai , sub second cold starts , any model integration you ask for
English
0
0
0
7
AnhPhu Nguyen
AnhPhu Nguyen@AnhPhuNguyen1·
with Mira, AI can now live on your face. capture every conversation. create the most personalized form of AI ever. order now.
English
457
331
3.2K
2.4M
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@im_roy_lee we at @InferXai have the infra to serve models who hold those insane contexts with sub second cold starts. exactly for your use case at InferX
English
0
0
1
15
Roy
Roy@im_roy_lee·
we tried to build this ourselves at cluely, but it's a hard problem to solve fundamentally the problem is with the models for proactive ai glasses (or even just proactive ai), you need > sub 300ms response time > continuous visual scene understanding > super high precision on "when to help" > insane context window > ultra low-friction output (a text blob is high friction) and like 10 other core features that just don't exist yet impressed that this team does not stop and are constantly finding workarounds to try and win an obvious future form factor
AnhPhu Nguyen@AnhPhuNguyen1

with Mira, AI can now live on your face. capture every conversation. create the most personalized form of AI ever. order now.

English
138
67
2.5K
483.4K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@Techmeme @dinabass throwing $ 615m at silicon proves how deep the pain point is. the next massive leap will be redefining the inference architecture itself. we at @InferXai are already on it.
English
0
0
0
10
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@ycombinator @sdianahu building new silicon for a serving and runtime issue is overkill inference is not a chip problem, it is a system problem agents need smarter scheduling and state management at the infra level to fix that 30-40% utilization gap that is exactly what we are solving at @InferXai
English
0
0
1
57
Y Combinator
Y Combinator@ycombinator·
Inference Chips for Agent Workflows @sdianahu Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result. That gap is where purpose-built silicon wins.
English
29
38
403
706K
Y Combinator
Y Combinator@ycombinator·
AI has stopped being a feature and started being the foundation. We're excited about a new wave of startups rebuilding software, services, and silicon— and pushing AI into the physical world. ycombinator.com/rfs
Y Combinator tweet media
English
210
956
8.9K
4.4M
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@arankomatsuzaki 100%. building custom silicon for a serving runtime issue is wild. inference is a system problem, not a chip problem. that is exactly the layer we are building at @InferXai
English
0
0
0
8
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
This feels like confusing a serving-runtime problem for a chip-startup opportunity. Agents do change inference patterns: loops, tool calls, branching, long context, KV reuse, burstiness. But most of that is an inference systems problem: scheduling, routing, KV-cache management, etc. Think Dynamo. By the time a new chip co tapes out + builds a compiler stack + wins cloud distribution, NVIDIA/AMD will likely have baked the obvious hardware-level optimizations into existing platforms.
Y Combinator@ycombinator

Inference Chips for Agent Workflows @sdianahu Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result. That gap is where purpose-built silicon wins.

English
14
10
99
27.5K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@0xSero @badlogicgames metrics and logs are crucial, but the bugs definitely aren't. @InferXai handles the serverless GPU inference smoothly so you can skip the setup headaches. might be a solid alternative for your workflows
English
0
0
0
14
0xSero
0xSero@0xSero·
once vllm-studio is less stop it should do just this, I mean it works for me but onboarding is non-existent and it has tons of bugs from months of vibecoding. It works though, this is how I run all my inference workflows. - Recipes (cached settings) - Usage metrics - Server logs - GPU and inference data
0xSero tweet media0xSero tweet media0xSero tweet media
English
3
1
11
1.6K
Mario Zechner
Mario Zechner@badlogicgames·
It would be lovely if inference engines would have endpoints that advertise: - all locally cached models and their specs (input modalities, context window size, thinking, tool caling) - loaded model(s) that would allow harnesses to easily enumerate things dynamically, instead of having users (or their agents) write silly config files.
English
21
7
163
13.1K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@TTT_Shogun spot on. if you're looking to power that OS with open-source models on serverless GPUs, check out @InferXai an execution layer needs zero cold starts to feel native, and we have that covered. DM me if you need any help
English
0
0
1
26
TTT
TTT@TTT_Shogun·
YC just made it official: AI-native companies will run on closed-loop, queryable systems. The same shift is happening one layer down. The AI-native individual needs a queryable life and an execution layer that closes the loop. That's the OS we're building. youtube.com/watch?si=tledo… @ycombinator
YouTube video
YouTube
English
1
3
5
242
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@AradhyeAgarwal hey , I work at @InferXai and they have already been doing serverless inference for agentic workflows , you can build on top of it , DM me if you want to partner with us
English
0
0
0
508
Aradhye Agarwal
Aradhye Agarwal@AradhyeAgarwal·
I'm applying to YC with the "Inference Chips for Agentic Workflows" idea. Looking for co-founders with a background in ML systems. DMs are open.
English
26
4
186
19.8K
Prashanth (Manohar) Velidandi
Saying “open source has to win” is useless. If it’s going to win, it has to be usable. Kimi K2.6 is already close to GPT-level quality. Model quality isn’t the bottleneck anymore. Access is. 3M models on @huggingface . ~1% have real inference support. What’s the point? We are talking about it on how we are changing on April 28th. Live Demo • Q&A luma.com/w22edapy
English
1
0
2
54
Ahmad
Ahmad@TheAhmadOsman·
I don’t have enough GPUs
English
70
7
212
10.3K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@AlexEngineerAI gwen distilled by Opus 4.6 inferenced by @InferXai truly is the best cost to quality ratio out there also got 30$ worth of free credits
English
0
0
2
65
Alex the Engineer
Alex the Engineer@AlexEngineerAI·
If you could only use one open source model for the rest of 2026, which one are you picking? I'm torn between Qwen3 and the new Gemma 4. Drop your stack in the replies. 👇
English
19
1
17
2.2K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@thdxr he math is right, but it ignores utilization. Step 2 falls apart if your GPUs sit idle during 2 minute cold starts. You need instant routing to make the economics work. This is exactly why the sub-second H100 spin ups on @InferXai actually matter for the bottom line.
English
0
0
2
279
dax
dax@thdxr·
inference is very profitable and probably a good opportunity to understand some basic business math 1. companies buy long lived assets like GPUs. these are one time costs and the asset depreciates over time 2. once you own this asset, you can plug it in and produce tokens which you can sell. the cost of goods sold here can be very low and you might be making 90% margins at scale, this is why we say inference is profitable 3. then you also hire employees to do r&d work to improve your systems, come up with new models, expand the business if you add these 3 up you end up with $0. you're not producing a profit because the business is growing and you're reinvesting it all buying assets or r&d to meet demand if it's obvious to other people the business is working, you can raise money from them to accelerate all these numbers so they max out in 5 years instead of 25 so on paper you'll be "losing money" every year but that's because you want to make sure you lock down the opportunity before someone else the bigger your market is the bigger this burn can be because it's a function of potential so when you see these companies losing a lot of money it doesn't mean the whole concept of their business broken it's possible they misjudge and overinvest on 1+3 and will suffer some consequences but fundamentally 2 does work
dax@thdxr

@d4m1n i'm a bit confused why so many people say api tokens are sold at a loss this isn't true - these models are incredibly expensive compared to the gpu time cost there's potential for 90% margin depending on the model

English
65
69
1.4K
151.5K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Had a great and informative meet with @MrTalkStock ! never thought a person with more than 250k followers across all social media would be so nice to hop on a call to guide and help always grateful to all the mentors
Yashsmith shah tweet media
English
0
0
5
55
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@HarshithLucky3 its really great , u can test it out on @InferXai , they give around 30$ worth of free credits and 2 h100 gpu's to test it out yourself and integrate into your application
English
0
0
1
32
Harshith
Harshith@HarshithLucky3·
WAIT WHAT Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is trending in top 2 on Hugging Face Is it good distilled model?
Harshith tweet media
English
35
24
827
132.8K
Yashsmith shah me-retweet
Harsh Chandramania
Harsh Chandramania@HChandramania·
@Spotify @spotifyindia a solution to this would be really helpful. Would appreciate if any of my followers here could repost this. P.S. had to post it as images because X won't just let me type the whole thing :(
Harsh Chandramania tweet mediaHarsh Chandramania tweet media
English
0
2
1
74
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@outsource_ Fitting weights in 16GB is cool, but agent loops require huge context windows that will instantly OOM that setup. The quantization hit is not worth it. I just use this exact Qwen variant on @InferXai One click for the uncompromised version
English
0
0
3
14
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
🚀 Imagine running Claude 4.6 Opus-level reasoning... but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit quantization (Q4_K_M) while slashing chain-of-thought bloat by 24%. ✅ Retains 96.91% HumanEval accuracy ✅ Perfect for agentic coding loops (no API costs or latency) 300K+ downloads on HF Link below 👇🏻
Eric ⚡️ Building... tweet media
English
99
291
3.4K
280.3K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@XunWallace The math does not work. A 4090 has 24GB VRAM, but a 70B model needs ~40GB You will still bottleneck offloading to Mac RAM This is why I use @InferXai now Sub-second cold starts on dual H100s feel completely local but actually hold the model
English
0
0
1
47
Rocky 🪨
Rocky 🪨@XunWallace·
This changes the economics of local AI inference completely. Mac Mini ($600) + Thunderbolt eGPU enclosure ($200) + RTX 4090 ($1600) = a $2400 local AI workstation running 70B+ parameter models at full GPU speed. Before this, Apple Silicon users were stuck with unified memory inference — fast for small models, painfully slow for anything serious. Now you get CUDA/ROCm performance on macOS without dual-booting Linux. tinygrad bypassed Apple's metal-only GPU restriction entirely. The real unlock: AI agents that need local compute (privacy-sensitive, low-latency, or offline) just got dramatically cheaper to deploy. tinygrad.org
the tiny corp@__tinygrad__

If you have a Thunderbolt or USB4 eGPU and a Mac, today is the day you've been waiting for! Apple finally approved our driver for both AMD and NVIDIA. It's so easy to install now a Qwen could do it, then it can run that Qwen...

English
2
1
5
1.4K
Aman
Aman@Amank1412·
Claude = coding. ($20/mo) Supabase = backend. (Free) Vercel = deploying. (Free) Namecheap = domain. ($12/ yr) Stripe = payments. (2.9%/ transaction) GitHub = version control. (Free) Resend = emails. (Free) Clerk = auth. (Free) Cloudflare = DNS. (Free) PostHog = analytics. (Free) Sentry = error tracking. (Free) Upstash = Redis. (Free) Pinecone = vector DB. (Free) For just $20/month, you can launch a fully functional startup
English
24
59
508
24.7K