RDP

7.8K posts

RDP

@Utpalodhi

Solutions Architect and an off day Philosopher, Photographer, History Buff 🏔 Likes/RTs ≠ Endorsement

Andheri E, Mumbai Katılım Temmuz 2011

1.7K Takip Edilen294 Takipçiler

RDP retweetledi

Ahmad@TheAhmadOsman·3 Nis

You don’t pick an Inference Engine You pick a Hardware Strategy and the Engine follows Inference Engines Breakdown (Cheat Sheet at the bottom) > llama.cpp runs anywhere CPU, GPU, Mac, weird edge boxes best when VRAM is tight and RAM is plenty hybrid offload, GGUF, ultimate portability not built for serious multi-node scale > MLX Apple Silicon weapon unified memory = “fits” bigger models than VRAM would allow but also slower than GPUs clean dev stack (Python/Swift/C++) sits on Metal (and expanding beyond) now supports CUDA + distributed too great for Mac-first workflows, not prod serving > ExLlamaV2 single RTX box go brrr EXL2 quant, fast local inference perfect for 1/2/3/4 GPU(s) setups (4090/3090) not meant for clusters or non-CUDA > ExLlamaV3 same idea, but bigger ambition multi-GPU, MoE, EXL3 quant consumer rigs pretending to be datacenters still CUDA-first, still rough edges depending on model > vLLM default answer for prod serving continuous batching, KV cache magic tensor / pipeline / data parallel runs on CUDA + ROCm (and some CPUs) this is your “serve 100s of users” engine > SGLang vLLM but more systems-brained routing, disaggregation, long-context scaling expert parallel for MoE built for ugly workloads at scale lives on top of CUDA / ROCm clusters this is infra nerd territory > TensorRT-LLM maximum NVIDIA performance FP8/FP4, CUDA graphs, insane throughput multi-node, multi-GPU, fully optimized pure CUDA stack, zero portability (And underneath all of it: Transformers → model architecture layer → CUDA / ROCm / TT-Metal → compute layer) What actually happens under the hood: > Transformers defines the model > CUDA / ROCm executes it > TT-Metal (if you’re insane) lets you write the kernel yourself The Inference Engine is just the orchestrator (simplified) When running LLMs locally, the bottleneck isn’t just “VRAM size” It isn’t even the model It’s: - memory bandwidth (the real limiter) - KV cache (explodes with long context) - interconnect (PCIe vs NVLink vs RDMA) - scheduler quality (batching + engine design) - runtime overhead (activations, graphs, etc) (and your compute stack decides all of this) P.S. Unified Memory is way slower than VRAM Cheat Sheet / Rules of Thumb > laptop / edge / weird hardware → llama.cpp > Mac workflows → MLX > 1–4 RTX GPUs → ExLlamaV2/V3 > general serving → vLLM > complex infra / long context / MoE → SGLang > NVIDIA max performance → TensorRT-LLM

English

748

108.7K

RDP@Utpalodhi·51m

@PGTruthteller @Nher_who Desh me saare opposition parties khatam ho gaye, bhagwan na kare tere sath BJP wale kuch bura kare

हिन्दी

P G@PGTruthteller·14h

@Nher_who Bhai Thoda relax karle 😆😆😆 Mamt@ se jyada to tuje taklif ho rahe hai lagta hai🤣🤣🤣

Indonesia

795

Nehr_who?@Nher_who·14h

Are bsdk 1st election was conducted in 1952 How come 1952-2026= 1400 years😭

English

623

20.5K

RDP retweetledi

Anika M@Anika_Breaths·3h

You asked for it Bengal ….

English

260

594

10.7K

RDP@Utpalodhi·57m

@nxhaaa19 Senior solution architect here, GPT 5.5 actually writes better code than Opus, if you actually know what better code looks like

English

neha@nxhaaa19·15h

Claude Opus 4.7 vs GPT-5.5 what nobody is telling you: -Opus 4.7 wins on coding. It's not close. -GPT-5.5 wins on terminal workflows and math -Both have 1M token context windows -Opus 4.7 sees images at 3x higher resolution -GPT-5.5 uses 72% fewer tokens for the same task -Opus 4.7 is better at fixing real GitHub issues -GPT-5.5 is faster and cheaper to run at scale They're not competing. They're optimized for different work.

English

107

266

22.3K

RDP retweetledi

Nehr_who?@Nher_who·8h

Honestly it’s not fair to give all the credits to EC alone. SC allowed the state to go for elections despite the pending approval of 27 lakh voters who had reached the court. Give them credit too.

English

470

2.2K

19.4K

RDP retweetledi

Kunal Pawar@kunalpawar330·12h

Free & Fair Elections🔥

English

372

1.5K

56.4K

RDP retweetledi

PunsterX@PunsterX·12h

The only opposition remaining in India now is– Rahul Gandhi, Dhruv Rathee and Kunal Kamra.

English

141

787

6.8K

67.1K

RDP retweetledi

Ivan Fioravanti ᯅ@ivanfioravanti·20h

Hermes Agent + Obsidian is match in heaven! I'm dumping all info, meetings, ideas to hermes and I get back well formatted and when needed. This is so incredible! Thanks @NousResearch for this piece of AI magic!

English

257

11.5K

RDP retweetledi

Harveen Singh Chadha@HarveenChadha·8h

Indian IT services companies got a competition they were not expecting

English

113

768

36.8K

RDP retweetledi

Richard Palethorpe@jichiep·13h

LocalVQE v1.1 is out: A tiny ~1M parameter model that does echo supression and noise cancellation in realtime. V1.1 comes with big improvements to audio quality over V1.

English

156

8.3K

RDP retweetledi

Oliver Prompts@oliviscusAI·15h

Someone just open-sourced a free alternative to SoundSource for macOS. It's called FineTune. Per-app volume sliders, 4x boost, multi-device output, 10-band EQ, and AutoEQ headphone correction.. all from your menu bar. 100% Open Source.

English

718

48.2K

RDP retweetledi

Alex Atallah@alexatallah·13h

OpenRouter can now cache your LLM responses for free!

OpenRouter@OpenRouter

Introducing Response Caching: save tons of money and time on tests and agent retries. Blog post: openrouter.ai/announcements/… Available for free. Learn more 👇

English

417

42.2K

RDP retweetledi

Warp@warpdotdev·3h

Warp's agent now has an `auto (open-weights)` model. This will route you to frontier-level open weight models based on the complexity of your task. High performance at a reduced token cost compared to "genius" mode.

English

1.7K

RDP retweetledi

AJ@ItsmeAjayKV·22h

Qwen3.6-35B-A3B (TQ3_4S ~4bpw) on RTX 3060 (12GB) via llama.cpp-tq3 (TurboQuant): • ~619 t/s prompt (4K ctx) • ~60 t/s generation (128K ctx) • fits in ~12.4GB VRAM 128K context with usable decode speed on a single 3060 is kind of wild

English

372

27.8K

RDP retweetledi

Ahmad Awais@MrAhmadAwais·7h

how did we make kimi k2.6 nearly beat opus 4.7 "open source models are bad at coding" is not a model issue, it's the coding agent harness problem! if you're using Claude Code to run open source models you ain't gonna make it. they don't want open models to win. i had a weird week. same model (deepseek v4 pro) ran our internal eval beat opus 6/10 and (kimi k2.6) almost there 5/10. same prompts, same checkpoints, same temperature. the only thing that moved was which upstream the gateway picked. i think a lot of "open model bad at coding" is actually "open model on cold cache." when you say a model wins an eval, you're really saying (model + provider + cache state) wins an eval. the harness is the part that decides the second two. context: i've been working on @CommandCodeAI's open-source path kimi k2.6, deepseek v4 pro, glm, qwen billions of tokens through a fanout of inference providers (building $1/mo Go plan for Command Code agent). by the end kimi k2.6 was hitting 5/10 against opus 4.7 and deepseek v4 pro 6/10 on the harder tool-heavy slices. four small plumbing changes did most of it. none of them touched the model. a few things i learned that feel general: 1. the biggest single win was one http header. closed models have prompt caching as a product. open models don't. what they have is prefix cache the inference server keeps the last N forward passes warm on a GPU's KV memory, and a request that shares a prefix with a recent one skips re-prefilling. it's compute-time, not product-tier, and it evaporates the second your request lands on a different node. a coding agent is pure prefix-cache exploitation. a good system prompt: ~10k tokens, constant. tool list: constant. conversation: append-only. every turn except the very first should be a near-total cache hit. should be. what we found: through lots of r&d and data wrangling, consecutive turns of the same conversation were getting load-balanced to different gpu pods. each one had to re-prefill our ~10k-token prefix from scratch. ttft was 6-8s. the model wasn't slow. the cache was being stolen from us by the load balancer. the fix is one line. but working with several providers and load balancing it makes it as a soft pin same value, same pod (best effort). we already had a stable session id in the cli. we forwarded it. ``` ttft dropped from 6-8s to under 1s on cached turns. and it matters for evals too when the prefix-cache is warm, the model spends its whole budget on the new tokens. cold prefill on a small open model eats latency that on opus is invisible because anthropic's fleet has product-tier caching baked in. closed models eat the cost silently. open models eat it loudly and then we blame the model. think of it like being in an hour long meeting and having coherent understanding of where we are in the conversation. if you had to re-brief yourself from notes every time you spoke, you'd be slow and forgetful. the model is the same. at Command Code we're building the best coding agent harness for open source models (and closed too!). such a nice fix for our users. save them money, make them faster, and make the model look better for free. 2. canonical model id is the load-bearing abstraction. we route the same model say `kimi-k2-6` through up to three providers in priority order. provideOne (p1) (lowest p50), then provideTwo (p2), then providerThree (p3). each one wants a different slug: `moonshotai/Kimi-K2-Instruct`, `moonshot/kimi-k2-6`, `@moonshot/kimi-k2-6`. each wants a different request shape (p2 wants `providerOptions.gateway`, p3 wants headers, p1 wants its own auth header). the temptation is to fork the request shape per provider all the way through the agent loop. don't. we keep one canonical id (`kimi-k2-6`) flowing through the entire request billing, telemetry, evals, fallback. the slug translation happens at exactly one boundary: `getProviderModelId(provider, canonicalId)`, called inside `buildOSSLanguageModel`, called at the moment we hand bytes to the sdk. this matters most on fallback. when p1 503s mid-stream and we walk to p2, we re-apply `applyEntryOptions(params, entry, canonicalId)` gateway options get rebuilt for the new entry, the message array stays untouched, the canonical id is unchanged. in our usage logs every kimi call is `kimi-k2-6` regardless of which gpu actually served it. evals don't lie about which model you tested. if you've ever tried to debug "did this turn go to p2 or p3," you know why this matters. caching ergonimics in this are state of art engineering. i've had most fun as an engineer building this. 3/ capability flags need per-provider negotiation. we ask the gateway for `zeroDataRetention: true` and `disallowPromptTraining: true`. these are request-level, not per-upstream the gateway is all-or-nothing on each. if any provider in our whitelist (`order`) lacks a flag, the gateway refuses with `NoNonTrainingProvidersError` and the request just dies. initial code path: hardcode both flags on. broke for half our models, because, e.g., novita is no-training but not zdr, fireworks is zdr but not no-training (these change month to month check current capability table, not this list). gateway refused everything. the fix was to drop each flag independently based on whether anyone in the whitelist lacks it: (`buildGatewayOptions`, same file.) request goes through with whichever guarantees the intersection of providers actually supports. you don't get to ask for a property that doesn't exist on the set you're routing over. this is the same shape as the tool-input repair stuff from last week when you hit a wall, the question to ask is "what's the minimal contract this set of upstreams can actually honor in common," not "how do i force my preferred contract through." not so happy with this, still working out the quirks, and want to enable ZDR vs cost management easy for our users. 4/ the funniest bug was a thinking-mode regression. deepseek v4 pro, multi-turn through the p2, started 400ing every continuation: "the reasoning_content in the thinking mode must be passed back to the api." what was happening: the gateway-side converter applies r1's reasoning-stripping logic to v4. r1 returns `reasoning_content` and you have to echo it back. v4 doesn't, and you don't. the converter strips it, the upstream rejects the absence, every multi-turn dies on turn 2. (`getReasoningProviderOptions`.) deepseek v4 pro now runs as a non-thinking model end-to-end. converter has nothing to drop. multi-turn works. we lose reasoning. that costs us a couple of points on hard one-shot puzzles. but coding agents are not one-shot puzzles they're 40-turn loops where turn-2-doesn't-400 matters more than chain-of-thought on turn-1. later we figured out how to teach the open models what's wrong in their toolCalls while repairing the calls runtime, yesterday i made a detailed thread on this, if you're curious. zoom out: four things, none of them about the model: - keep prefix cache hot across a conversation and providers and open models (job of a harness, not the model or provider). - one abstraction (canonical id at the request layer, slug translation at the sdk boundary) so fallbacks are invisible to billing and evals. - one filter (drop capability flags independently against per-upstream support sets) so the gateway doesn't refuse on a property mismatch. - one workaround (disable thinking on a single provider prefix) for an upstream sdk bug that breaks multi-turn. teach the model how to fix its tool calls in deterministic ways instead of just sending it errors which results in a silent failure. the model didn't get smarter. the harness stopped throwing away its work between turns. a closed-model harness can be lazy about all four of these because anthropic and openai eat the cost server-side their caching is built in, their model id is unambiguous, their capability flags are consistent, their tool contracts have been pretrained on. an open-model harness can't be lazy about any of them, and if it's lazy about one, the model "loses" an eval/vibe check it would otherwise have won. deepseek v4 pro now beats opus 4.7 6/10 on our internal evals. kimi k2.6 hits 5/10. nothing about the weights moved. imo if your open model is "bad at coding," in most cases you're using the wrong coding agent harness that doesn't care about your model and is super generic across hundreds of models or only cares about the closed models with product-tier caching. you can try all these fixes, they're live.

Ahmad Awais@MrAhmadAwais

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending `null` for an optional field instead of omitting it - emitting `["a","b"]` as a json *string* instead of an actual array - wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. deepseek-flash, when asked to edit or write a file, sometimes emits the path as a *markdown auto-link*: filePath: "/Users/x/proj/[notes.md](http://notes. md)" our writeFile tool obediently trued creating files literally named `[notes.md](http://notes .md)` until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like `[click](https://x .com)` passes through untouched. this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict. "tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level `pathString()` instead of `z.string()` and the leak is plugged for every path field at once. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that *happened* to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test. then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log `tool_input_repaired:${toolName}`. on failure, log `tool_input_invalid:${toolName}` and return a model-readable retry message. the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence. (this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.) 4/ shape invariants and relational invariants need different fixes. the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a *relational* invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling `readFile({ absolutePath, limit: 30 })` and getting an `ERROR:` back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them. so i taught the function the model's intent instead. `limit` alone → `offset = 0`. `offset` alone → `limit = 2000` (matches common read tool ops default). then surfaced the decision back to the model in the result: "Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit." no `Error:` prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big. repair where you can. extend semantics where you can't. surface the choice either way. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model.

English

145

14.2K

RDP retweetledi

Ben Davis@davis7·4h

This is very late, but I'm finally done with my 5.5 vid - use low reasoning - the name sucks - it's fast - best code I've ever seen a model write came from this model - openai's new pre-training is amazing - price looks worse than it is - over sensitive to every little thing in it's context window - feels wildly different compared to 5.4 - turn reasoning off. try it. turn the reasoning off. do it.

English

425

70.4K

RDP retweetledi

0xSero@0xSero·3h

I started using GPT-5.5 on low/no reasoning because of Ben and since then I can: 1. Activate fast mode all day without running out of credits 2. Time to task completion is 10% what it was with thinking 3. The model feels significantly more like a Claude model 4. Cheap AF

Ben Davis@davis7

English

460

42.4K

RDP retweetledi

Rahul Gandhi@RahulGandhi·9h

Assam and Bengal are clear cases of the election being stolen by the BJP with the support of the EC. We agree with Mamata ji. More than 100 seats were stolen in Bengal. We have seen this playbook before: Madhya Pradesh. Haryana. Maharashtra. Lok Sabha 2024 etc चुनाव चोरी, संस्था चोरी - अब और चारा ही क्या है!

English

8.4K

9.4K

36.4K

1.7M

RDP retweetledi

Nehr_who?@Nher_who·9h

This is how they treat women. This is the mindset of an Average orange Zombie Unfortunately the cancer has reached Bengal

English

395

2.1K

33.4K

RDP retweetledi

Akshat Shrivastava@Akshat_World·7h

We are likely to have harder times in India (economically speaking). The politicians have figured out the magic formula of winning elections -- even with negative economic progress. Economy, jobs, living standards used to be the focal point of elections. Now, no more. There is no political pressure anymore to pursue good economics. Honestly, not many would bother even if we have de-facto direct taxation at 50%. You can come cry on Twitter. And, go away :) Poor economics= few opportunities= fewer jobs= lower wealth. The writing is there on the wall. The sooner you see it, the sooner you can rework your investment strategies.

English

154

658

3.5K

98.6K

Keşfet

@PGTruthteller @Nher_who @nxhaaa19 @NousResearch @CommandCodeAI @elonmusk @BarackObama @taylorswift13