Yashsmith shah

332 posts

Yashsmith shah banner
Yashsmith shah

Yashsmith shah

@Yashsmith_dev

Just an engineer trying to turn curiosity into code | 8x hackathon winner incl NVIDIA AI hack | Tech club chair | currently in the GenAI & startup rabbit hole

San Francisco, CA 参加日 Kasım 2021
499 フォロー中74 フォロワー
Ahmad
Ahmad@TheAhmadOsman·
I don’t have enough GPUs
English
70
7
211
10K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@AlexEngineerAI gwen distilled by Opus 4.6 inferenced by @InferXai truly is the best cost to quality ratio out there also got 30$ worth of free credits
English
0
0
2
54
Alex the Engineer
Alex the Engineer@AlexEngineerAI·
If you could only use one open source model for the rest of 2026, which one are you picking? I'm torn between Qwen3 and the new Gemma 4. Drop your stack in the replies. 👇
English
19
1
17
2K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@thdxr he math is right, but it ignores utilization. Step 2 falls apart if your GPUs sit idle during 2 minute cold starts. You need instant routing to make the economics work. This is exactly why the sub-second H100 spin ups on @InferXai actually matter for the bottom line.
English
0
0
2
235
dax
dax@thdxr·
inference is very profitable and probably a good opportunity to understand some basic business math 1. companies buy long lived assets like GPUs. these are one time costs and the asset depreciates over time 2. once you own this asset, you can plug it in and produce tokens which you can sell. the cost of goods sold here can be very low and you might be making 90% margins at scale, this is why we say inference is profitable 3. then you also hire employees to do r&d work to improve your systems, come up with new models, expand the business if you add these 3 up you end up with $0. you're not producing a profit because the business is growing and you're reinvesting it all buying assets or r&d to meet demand if it's obvious to other people the business is working, you can raise money from them to accelerate all these numbers so they max out in 5 years instead of 25 so on paper you'll be "losing money" every year but that's because you want to make sure you lock down the opportunity before someone else the bigger your market is the bigger this burn can be because it's a function of potential so when you see these companies losing a lot of money it doesn't mean the whole concept of their business broken it's possible they misjudge and overinvest on 1+3 and will suffer some consequences but fundamentally 2 does work
dax@thdxr

@d4m1n i'm a bit confused why so many people say api tokens are sold at a loss this isn't true - these models are incredibly expensive compared to the gpu time cost there's potential for 90% margin depending on the model

English
65
70
1.4K
145.2K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Had a great and informative meet with @MrTalkStock ! never thought a person with more than 250k followers across all social media would be so nice to hop on a call to guide and help always grateful to all the mentors
Yashsmith shah tweet media
English
0
0
2
19
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@HarshithLucky3 its really great , u can test it out on @InferXai , they give around 30$ worth of free credits and 2 h100 gpu's to test it out yourself and integrate into your application
English
0
0
1
27
Harshith
Harshith@HarshithLucky3·
WAIT WHAT Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is trending in top 2 on Hugging Face Is it good distilled model?
Harshith tweet media
English
38
23
825
131.5K
Yashsmith shah がリツイート
Harsh Chandramania
Harsh Chandramania@HChandramania·
@Spotify @spotifyindia a solution to this would be really helpful. Would appreciate if any of my followers here could repost this. P.S. had to post it as images because X won't just let me type the whole thing :(
Harsh Chandramania tweet mediaHarsh Chandramania tweet media
English
0
2
1
42
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@outsource_ Fitting weights in 16GB is cool, but agent loops require huge context windows that will instantly OOM that setup. The quantization hit is not worth it. I just use this exact Qwen variant on @InferXai One click for the uncompromised version
English
0
0
3
14
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
🚀 Imagine running Claude 4.6 Opus-level reasoning... but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit quantization (Q4_K_M) while slashing chain-of-thought bloat by 24%. ✅ Retains 96.91% HumanEval accuracy ✅ Perfect for agentic coding loops (no API costs or latency) 300K+ downloads on HF Link below 👇🏻
Eric ⚡️ Building... tweet media
English
100
289
3.4K
277.3K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@XunWallace The math does not work. A 4090 has 24GB VRAM, but a 70B model needs ~40GB You will still bottleneck offloading to Mac RAM This is why I use @InferXai now Sub-second cold starts on dual H100s feel completely local but actually hold the model
English
0
0
1
43
Rocky 🪨
Rocky 🪨@XunWallace·
This changes the economics of local AI inference completely. Mac Mini ($600) + Thunderbolt eGPU enclosure ($200) + RTX 4090 ($1600) = a $2400 local AI workstation running 70B+ parameter models at full GPU speed. Before this, Apple Silicon users were stuck with unified memory inference — fast for small models, painfully slow for anything serious. Now you get CUDA/ROCm performance on macOS without dual-booting Linux. tinygrad bypassed Apple's metal-only GPU restriction entirely. The real unlock: AI agents that need local compute (privacy-sensitive, low-latency, or offline) just got dramatically cheaper to deploy. tinygrad.org
the tiny corp@__tinygrad__

If you have a Thunderbolt or USB4 eGPU and a Mac, today is the day you've been waiting for! Apple finally approved our driver for both AMD and NVIDIA. It's so easy to install now a Qwen could do it, then it can run that Qwen...

English
2
1
5
1.3K
Aman
Aman@Amank1412·
Claude = coding. ($20/mo) Supabase = backend. (Free) Vercel = deploying. (Free) Namecheap = domain. ($12/ yr) Stripe = payments. (2.9%/ transaction) GitHub = version control. (Free) Resend = emails. (Free) Clerk = auth. (Free) Cloudflare = DNS. (Free) PostHog = analytics. (Free) Sentry = error tracking. (Free) Upstash = Redis. (Free) Pinecone = vector DB. (Free) For just $20/month, you can launch a fully functional startup
English
24
58
509
24K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@joshalbrecht My exact workflow lately. Managing the agents is solved, but you need instant compute for it to feel usable. I run this kind of swarm on InferX with a Qwen 3.5 opus 4.6 distilled model. Having 2 dedicated H100s with sub second boot times completely changes the experience
English
0
0
0
30
Josh Albrecht
Josh Albrecht@joshalbrecht·
mngr: programmatically manage 100s of claude code sessions in parallel 🤖 open source today. lets you do things like: — for each open GitHub issue, create a PR — for each flaky test in the past week, fix it — for each rule in style guide, scan codebase & fix all instances runs any agent: @claudeai, codex, @opencode, etc. runs on any compute: locally, @modal, @Docker, or anything you can ssh into.
GIF
English
28
37
159
28.9K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@akshat_b @modal 5 to 12 seconds is a very solid jump from a cold 2 minutes Good find It really puts into perspective how aggressive the optimization race is getting right now, considering InferX is already pulling off under 1 second. Fun time to be following the infra side of things
English
0
0
0
80
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
9x hackathon win missed 1st place by a point 1 step closer to reaching double digit
Yashsmith shah tweet media
English
0
0
1
32
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@manoj_ahi At this point, the "Max" just stands for Maximum Disappointment Anyone else seeing this today?
English
0
0
0
31
Manoj Ahirwar
Manoj Ahirwar@manoj_ahi·
usage limit over in 20 mins!!! did something change in claude code? I am on Max plan
Manoj Ahirwar tweet media
English
251
69
2.4K
315.7K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
@sattyyouneed So basically GitHub with a 'swipe right' for people who actually commit code
English
0
0
2
14
Satyam
Satyam@sattyyouneed·
Startup idea: Tinder but for guys to find other guys to do cool projects with
English
298
93
2.2K
146.7K
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
never though a stranger I met on X would be more helpful than most people around me
Yashsmith shah tweet media
English
0
0
4
49
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Building CUDA Code! a GPU native coding copilot runs local LLMs, taps cloud GPUs for inference, and optimizes code in real time 🚀 lets connect.
English
1
0
4
69
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Scheduling across time zones is 10% actual meeting and 90% clarifying if "let's do at 4" means AM or PM, IST or PST
Yashsmith shah tweet media
English
0
1
5
60
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
TurboQuant + LeWorldModel + agentic agents = self evolving personal physics engine on consumer GPUs TurboQuant kills memory limits LeWorldModel adds real physics Agents remove human bottlenecks Garage robotics & indie devs just leveled up Laptops beating old superclusters ?👀
English
0
0
3
33
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Now add agentic optimization layers self improving AI agents that rewrite GPU kernels better than humans While you sleep, they evolve the stack: smarter attention, tighter compression, new shortcuts Your laptop doesn’t just run it. It wakes up better every day.
English
1
0
3
33
Yashsmith shah
Yashsmith shah@Yashsmith_dev·
Just read about Google’s TurboQuant. It cuts LLM KV cache memory by 6x+ with zero accuracy loss and up to 8x faster attention. No retraining needed. My mind instantly wandered… what could this do when mixed with other fresh breakthroughs ? 👇
English
1
0
3
63