Yashsmith shah

332 posts

Yashsmith shah

@Yashsmith_dev

Just an engineer trying to turn curiosity into code | 8x hackathon winner incl NVIDIA AI hack | Tech club chair | currently in the GenAI & startup rabbit hole

San Francisco, CA 参加日 Kasım 2021

499 フォロー中74 フォロワー

Yashsmith shah@Yashsmith_dev·2d

@TheAhmadOsman U can always get more on @InferXai 2 h100s 👀

English

Ahmad@TheAhmadOsman·2d

I don’t have enough GPUs

English

211

10K

Yashsmith shah@Yashsmith_dev·3d

@AlexEngineerAI gwen distilled by Opus 4.6 inferenced by @InferXai truly is the best cost to quality ratio out there also got 30$ worth of free credits

English

Alex the Engineer@AlexEngineerAI·3d

If you could only use one open source model for the rest of 2026, which one are you picking? I'm torn between Qwen3 and the new Gemma 4. Drop your stack in the replies. 👇

English

Yashsmith shah@Yashsmith_dev·4d

@thdxr he math is right, but it ignores utilization. Step 2 falls apart if your GPUs sit idle during 2 minute cold starts. You need instant routing to make the economics work. This is exactly why the sub-second H100 spin ups on @InferXai actually matter for the bottom line.

English

235

dax@thdxr·4d

inference is very profitable and probably a good opportunity to understand some basic business math 1. companies buy long lived assets like GPUs. these are one time costs and the asset depreciates over time 2. once you own this asset, you can plug it in and produce tokens which you can sell. the cost of goods sold here can be very low and you might be making 90% margins at scale, this is why we say inference is profitable 3. then you also hire employees to do r&d work to improve your systems, come up with new models, expand the business if you add these 3 up you end up with $0. you're not producing a profit because the business is growing and you're reinvesting it all buying assets or r&d to meet demand if it's obvious to other people the business is working, you can raise money from them to accelerate all these numbers so they max out in 5 years instead of 25 so on paper you'll be "losing money" every year but that's because you want to make sure you lock down the opportunity before someone else the bigger your market is the bigger this burn can be because it's a function of potential so when you see these companies losing a lot of money it doesn't mean the whole concept of their business broken it's possible they misjudge and overinvest on 1+3 and will suffer some consequences but fundamentally 2 does work

dax@thdxr

@d4m1n i'm a bit confused why so many people say api tokens are sold at a loss this isn't true - these models are incredibly expensive compared to the gpu time cost there's potential for 90% margin depending on the model

English

1.4K

145.2K

Yashsmith shah@Yashsmith_dev·4d

Had a great and informative meet with @MrTalkStock ! never thought a person with more than 250k followers across all social media would be so nice to hop on a call to guide and help always grateful to all the mentors

English

Yashsmith shah@Yashsmith_dev·6 Nis

@HarshithLucky3 its really great , u can test it out on @InferXai , they give around 30$ worth of free credits and 2 h100 gpu's to test it out yourself and integrate into your application

English

Harshith@HarshithLucky3·6 Nis

WAIT WHAT Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is trending in top 2 on Hugging Face Is it good distilled model?

English

825

131.5K

Yashsmith shah がリツイート

Harsh Chandramania@HChandramania·5 Nis

@Spotify @spotifyindia a solution to this would be really helpful. Would appreciate if any of my followers here could repost this. P.S. had to post it as images because X won't just let me type the whole thing :(

English

Yashsmith shah@Yashsmith_dev·6 Nis

@outsource_ Fitting weights in 16GB is cool, but agent loops require huge context windows that will instantly OOM that setup. The quantization hit is not worth it. I just use this exact Qwen variant on @InferXai One click for the uncompromised version

English

Eric ⚡️ Building...@outsource_·31 Mar

🚀 Imagine running Claude 4.6 Opus-level reasoning... but entirely on your own GPU with just 16GB VRAM. This 27B Qwen3.5 variant, distilled on Claude 4.6 Opus reasoning traces, delivers frontier coding power locally. It’s beating Claude Sonnet 4.5 on SWE-bench in 4-bit quantization (Q4_K_M) while slashing chain-of-thought bloat by 24%. ✅ Retains 96.91% HumanEval accuracy ✅ Perfect for agentic coding loops (no API costs or latency) 300K+ downloads on HF Link below 👇🏻

English

100

289

3.4K

277.3K

Yashsmith shah@Yashsmith_dev·6 Nis

@XunWallace The math does not work. A 4090 has 24GB VRAM, but a 70B model needs ~40GB You will still bottleneck offloading to Mac RAM This is why I use @InferXai now Sub-second cold starts on dual H100s feel completely local but actually hold the model

English

Rocky 🪨@XunWallace·2 Nis

This changes the economics of local AI inference completely. Mac Mini ($600) + Thunderbolt eGPU enclosure ($200) + RTX 4090 ($1600) = a $2400 local AI workstation running 70B+ parameter models at full GPU speed. Before this, Apple Silicon users were stuck with unified memory inference — fast for small models, painfully slow for anything serious. Now you get CUDA/ROCm performance on macOS without dual-booting Linux. tinygrad bypassed Apple's metal-only GPU restriction entirely. The real unlock: AI agents that need local compute (privacy-sensitive, low-latency, or offline) just got dramatically cheaper to deploy. tinygrad.org

the tiny corp@__tinygrad__

If you have a Thunderbolt or USB4 eGPU and a Mac, today is the day you've been waiting for! Apple finally approved our driver for both AMD and NVIDIA. It's so easy to install now a Qwen could do it, then it can run that Qwen...

English

1.3K

Yashsmith shah@Yashsmith_dev·5 Nis

@Amank1412 inferX = GPU Inference (Fastest)

Indonesia

103

Aman@Amank1412·4 Nis

Claude = coding. ($20/mo) Supabase = backend. (Free) Vercel = deploying. (Free) Namecheap = domain. ($12/ yr) Stripe = payments. (2.9%/ transaction) GitHub = version control. (Free) Resend = emails. (Free) Clerk = auth. (Free) Cloudflare = DNS. (Free) PostHog = analytics. (Free) Sentry = error tracking. (Free) Upstash = Redis. (Free) Pinecone = vector DB. (Free) For just $20/month, you can launch a fully functional startup

English

509

24K

Yashsmith shah@Yashsmith_dev·4 Nis

@joshalbrecht My exact workflow lately. Managing the agents is solved, but you need instant compute for it to feel usable. I run this kind of swarm on InferX with a Qwen 3.5 opus 4.6 distilled model. Having 2 dedicated H100s with sub second boot times completely changes the experience

English

Josh Albrecht@joshalbrecht·2 Nis

mngr: programmatically manage 100s of claude code sessions in parallel 🤖 open source today. lets you do things like: — for each open GitHub issue, create a PR — for each flaky test in the past week, fix it — for each rule in style guide, scan codebase & fix all instances runs any agent: @claudeai, codex, @opencode, etc. runs on any compute: locally, @modal, @Docker, or anything you can ssh into.

GIF

English

159

28.9K

Yashsmith shah@Yashsmith_dev·4 Nis

@akshat_b @modal 5 to 12 seconds is a very solid jump from a cold 2 minutes Good find It really puts into perspective how aggressive the optimization race is getting right now, considering InferX is already pulling off under 1 second. Fun time to be following the infra side of things

English

Akshat Bubna@akshat_b·3 Nis

Only on @modal :)

Ramp Labs@RampLabs

x.com/i/article/2039…

English

140

19.3K

Yashsmith shah@Yashsmith_dev·4 Nis

9x hackathon win missed 1st place by a point 1 step closer to reaching double digit

English

Yashsmith shah@Yashsmith_dev·31 Mar

@manoj_ahi At this point, the "Max" just stands for Maximum Disappointment Anyone else seeing this today?

English

Manoj Ahirwar@manoj_ahi·30 Mar

usage limit over in 20 mins!!! did something change in claude code? I am on Max plan

English

251

2.4K

315.7K

Yashsmith shah@Yashsmith_dev·31 Mar

@sattyyouneed So basically GitHub with a 'swipe right' for people who actually commit code

English

Satyam@sattyyouneed·30 Mar

Startup idea: Tinder but for guys to find other guys to do cool projects with

English

298

2.2K

146.7K

Yashsmith shah@Yashsmith_dev·30 Mar

never though a stranger I met on X would be more helpful than most people around me

English

Yashsmith shah@Yashsmith_dev·29 Mar

Building CUDA Code! a GPU native coding copilot runs local LLMs, taps cloud GPUs for inference, and optimizes code in real time 🚀 lets connect.

English

Yashsmith shah@Yashsmith_dev·27 Mar

Scheduling across time zones is 10% actual meeting and 90% clarifying if "let's do at 4" means AM or PM, IST or PST

English

Yashsmith shah@Yashsmith_dev·26 Mar

TurboQuant + LeWorldModel + agentic agents = self evolving personal physics engine on consumer GPUs TurboQuant kills memory limits LeWorldModel adds real physics Agents remove human bottlenecks Garage robotics & indie devs just leveled up Laptops beating old superclusters ?👀

English

Yashsmith shah@Yashsmith_dev·26 Mar

Now add agentic optimization layers self improving AI agents that rewrite GPU kernels better than humans While you sleep, they evolve the stack: smarter attention, tighter compression, new shortcuts Your laptop doesn’t just run it. It wakes up better every day.

English

Yashsmith shah@Yashsmith_dev·26 Mar

Just read about Google’s TurboQuant. It cuts LLM KV cache memory by 6x+ with zero accuracy loss and up to 8x faster attention. No retraining needed. My mind instantly wandered… what could this do when mixed with other fresh breakthroughs ? 👇

English

ディスカバー

@TheAhmadOsman @InferXai @AlexEngineerAI @thdxr @MrTalkStock @HarshithLucky3 @Spotify @spotifyindia