ed_the_engineer

66 posts

ed_the_engineer

@ed_the_engineer

Still grinding for the “yes you’re actually an engineer” badge

Katılım Ocak 2018

483 Takip Edilen52 Takipçiler

Sabitlenmiş Tweet

ed_the_engineer@ed_the_engineer·7 Mar

🧵 I just saw the future of AI-powered agentic coding. 10,000+ tokens/second (up to 17K tokens/second). Almost realtime agentic execution. I got early API access from @taalas_inc and built a demo of what agentic coding looks like when inference is basically instant. This changes everything. Here's what I learned 👇

English

316

ed_the_engineer@ed_the_engineer·1d

@stevibe @grok can it be run on rtx 5090 and what are token speed ?

English

478

stevibe@stevibe·1d

GLM 5.1 just went open-weight on Hugging Face, but how does it compare to GLM 5? I have tested both with the canvas tree challenge. 5.1 thinks longer, but delivers wind animation, sun, clouds, and way more detail. Prompt attached: Write a single HTML file with a full-page canvas, no libraries. Animate a tree that grows from the bottom center of the screen in real time. The trunk grows upward first, then branches split off recursively with slight randomness in angle and length. Each generation of branches should be thinner and slightly lighter in color. When branches reach their final size, add small leaves as soft green circles at the tips. The tree should take about 15 seconds to fully grow. Use warm brown for wood and varied greens for leaves against a soft sky-blue gradient background.

Z.ai@Zai_org

Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations. Blog: z.ai/blog/glm-5.1 Weights: huggingface.co/zai-org/GLM-5.1 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Coming to chat.z.ai in the next few days.

English

388

44.2K

ed_the_engineer retweetledi

Grok@grok·6d

Gemma 4 has 4 sizes for local runs: - E2B (2.3B eff params): Edge devices like phones/Raspberry Pi/Jetson Nano. ~4-8GB RAM/CPU only. - E4B (4.5B eff): Laptops/low-end hardware. Similar low footprint. - 26B A4B MoE (25B total, 3.8B active): Consumer GPUs – runs fast like a 4B model. - 31B dense: Mid-range GPUs/workstations (4-bit quantized: ~16-20GB VRAM est.). All multimodal (text/image, audio on small ones), 128K-256K context. Grab from Hugging Face, run via Ollama/Transformers. Start small!

English

6.4K

ed_the_engineer@ed_the_engineer·6d

@GoogleDeepMind @grok what resources needed to run it locally?

English

23K

Google DeepMind@GoogleDeepMind·6d

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

GIF

English

359

1.2K

8.8K

3.8M

ed_the_engineer retweetledi

Qwen@Alibaba_Qwen·6d

（1/8）🚀 Introducing Qwen3.6-Plus: Towards Real-World Agents! 🤖 Today, we’re thrilled to drop a major milestone in our journey toward native multimodal agents. Here is what makes Qwen3.6-Plus a game-changer： 💻 Next-level Agentic Coding: Smarter, faster execution. 👁️ Enhanced Multimodal Vision: Sharper perception & reasoning. 🏆 Top-tier Performance: Maintaining leading general capabilities. 📚 1M Context Window: Available by default via our API. Built on your invaluable feedback from the Qwen3.5 era, we’re laying a rock-solid foundation for real-world devs. Get ready to experience truly transformative ✨ Vibe Coding ✨. Huge thanks to our community! Go try it out and show us what you can build. 👇 Chat: chat.qwen.ai API: modelstudio.console.alibabacloud.com/ap-southeast-1… Blog: qwen.ai/blog?id=qwen3.6 🔔Noted：More Qwen3.6 models to come and be open-sourced! Stay tuned~ 👀#Qwen #AI #AgenticCoding #VibeCoding #Agents

English

227

661

995.7K

ed_the_engineer@ed_the_engineer·19 Mar

@TheAhmadOsman @grok what are gpu requirements ro run MiniMaxm2.7

English

Ahmad@TheAhmadOsman·18 Mar

MiniMax M2.7 is looking realllly good Cannot wait to try to it locally, also MiniMax M3 is probably gonna be massive

English

159

5.9K

ed_the_engineer@ed_the_engineer·11 Mar

@thdxr @taalas_inc I made tools calling on top of TAALAS API just to see how it feels like an agentic AI. x.com/ed_the_enginee…

ed_the_engineer@ed_the_engineer

English

dax@thdxr·21 Şub

@taalas_inc should hook up opencode and see what it feels like

English

259

18.8K

Taalas Inc.@taalas_inc·19 Şub

24 dedicated people. $30M spent on development. Extreme specialization, speed, and power efficiency. Today we launch Taalas’ first product. Check it out: Details: taalas.com/the-path-to-ub… Demo chatbot: chatjimmy.ai API: taalas.com/api-request-fo…

English

467

581

6.1K

4.1M

ed_the_engineer@ed_the_engineer·11 Mar

@ctnzr @LOFI911 @grok what minimum gpu required to run it?

English

304

Bryan Catanzaro@ctnzr·11 Mar

@LOFI911 huggingface.co/nvidia/NVIDIA-…

QME

1.9K

Bryan Catanzaro@ctnzr·11 Mar

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!

English

205

1.2K

204.4K

ed_the_engineer@ed_the_engineer·9 Mar

@Ex0byt what? in a browser? 🤔 Calculating possibilities …

English

Eric@Ex0byt·8 Mar

Numbers in! Target: 150 tok/s browser-native Qwen 3.5 inference Achieved: 180 tok/s — Qwen 3.5 INT4, WebGPU, browser-only. No cloud, APIs, servers, just Pure WebGPU + WGSL optimizations, HF SafeTensors, single browser tab. Open AI, for real.

English

108

1.4K

256K

ed_the_engineer@ed_the_engineer·8 Mar

@jonathanbrnd then what can I do with my 48 followers budget? 🤔

English

Jonathan@jonathanbrnd·7 Mar

just realised my reach on x is inversely correlated with my startup’s revenue

English

1.1K

ed_the_engineer@ed_the_engineer·8 Mar

@Rambuilds Frontend: vue.js Backend: python FastApi DB: postgres AI: For PoC Openrouter Queue: RMQ/redis queue - I like to build queue/worker style app, you later can plug in different languages if needed for tasks, without changing main app. Infra: dockers, cloudflare

English

Ram@Rambuilds·7 Mar

Builders on X What tech stack are you using for your project? Frontend: Backend: Database: AI / Tools: Always curious to see what people are building with. #BuildInPublic

Ram@Rambuilds

Builders on X What are you building right now? App. Startup. Side project. Content. I want more builders on my timeline. Let’s connect 🤝🏻 #BuildInPublic

English

1.6K

ed_the_engineer@ed_the_engineer·8 Mar

@shhiva_dev 👋

QME

Shiva@shhiva_dev·7 Mar

if you are in tech Let's connect

English

142

132

4.7K

ed_the_engineer@ed_the_engineer·8 Mar

@KaiXCreator 4.5K or 4.5M?

English

ed_the_engineer@ed_the_engineer·7 Mar

11. We're about to mass-produce AI inference on purpose-built chips. Not general-purpose GPUs. What I saw with Taalas is early — limited model, hacked tool calling, small context. But the SPEED is real. 10,000+ tok/s is real. Now imagine that speed with a 2027-2030 model's capabilities. The companies building this silicon infrastructure will power the next decade of software. Every app becomes an AI app. Not because of better models — because inference becomes too fast and cheap NOT to use everywhere. The future isn't "AI as a service." It's "AI as infrastructure."

English

ed_the_engineer@ed_the_engineer·7 Mar

English

316

ed_the_engineer@ed_the_engineer·7 Mar

10. The demo only runs simple single-command queries — hostname, uptime, df, cat, dig. Anything multi-step breaks. Why? The 8B model loses coherence after 2-3 tool rounds and starts outputting text instead of JSON. It also defaults to Linux commands on macOS, hallucinates unnecessary steps, and has a strong bias toward certain commands regardless of what you ask. Context window is another hard limit — I cap at ~20K tokens with aggressive summarization. Cloud models give you 128K-200K+. That's the difference between "check hostname" and "refactor this entire module." ~400ms round-trip latency in my testing (official spec is under 200ms — network adds overhead). Still, an 8B model running simple tasks in under 2 seconds with 8+ LLM calls is wild.

English

ed_the_engineer@ed_the_engineer·7 Mar

9. Llama 3.1 8B has no native tool calling. So I built it from scratch with prompt engineering. The model outputs structured JSON like `{"tool": "run_command", "command": "hostname"}` — my Python code parses it with json_repair (because 8B models don't always produce clean JSON), executes the tool, feeds the result back, and loops. I implemented 3 simple tools: `run_command`, `read_file`, `write_file`. That's it. Enough to demo agentic behavior.

English

ed_the_engineer@ed_the_engineer·7 Mar

8. four tmux panes running simultaneously: - Agent (top-left) — the AI agent taking queries and executing commands - Metrics (bottom-left) — live decode rate, token counts, latency per call - API Calls (right-top) — every LLM call logged in real-time - Raw Prompts (right-bottom) — the actual prompts and responses flying back and forth Every single query triggers 8+ LLM calls: intent parsing, 5 parallel safety/syntax/intent verification checks, the main agent reasoning, watchdog monitoring, and output condensing. All completing in under 2 seconds total.

English

ed_the_engineer@ed_the_engineer·7 Mar

7. Back to what I demoed — here's why speed matters specifically for agentic coding: Today's AI coding agents make 5-20 tool calls per task. Each one: 5-15 second wait. Total: minutes of waiting. With Taalas-speed inference on a capable model: - 100+ tool calls per task become practical - The agent can try 10 approaches and pick the best one - It can validate every line it writes against the full codebase - Real-time pair programming where the AI actually keeps up with your thinking We go from "AI assistant" to "AI colleague working at superhuman speed."

English

ed_the_engineer@ed_the_engineer·7 Mar

6. Things that are too expensive or too slow with cloud AI today: - Real-time code review on every keystroke (not just on save) - AI validation on every form field as users type - Running 500 AI test scenarios before every git commit - AI-powered observability that understands your business logic, not just metrics - Chatbots that respond in <100ms — faster than human conversation pace - CI/CD pipelines where AI reviews, tests, and approves in seconds The bottleneck stops being "AI is slow/expensive" and becomes "what problems are worth solving?"

English

ed_the_engineer@ed_the_engineer·7 Mar

5. This is where it gets wild. Small and medium businesses can afford to run EVERY business event through AI: - Every invoice → AI fraud detection + categorization - Every customer email → instant AI triage + draft response - Every inventory change → AI demand prediction - Every employee timesheet → AI project cost analysis - Every support ticket → AI resolution or escalation Today this would cost thousands/month in cloud API calls. With on-chip inference at scale? A fraction of that cost.

English

ed_the_engineer@ed_the_engineer·7 Mar

These chips sit in racks next to your application servers. Same data center. Same network switch. AI inference becomes like calling a local API — single-digit millisecond latency, always available, predictable cost. You stop thinking "should I use AI here?" and start thinking "why WOULDN'T I use AI here?" Background AI processing on every business event becomes trivial. Validation, classification, anomaly detection — all running continuously without the latency or cost penalty of cloud API calls.

English

Keşfet

@stevibe @grok @GoogleDeepMind @TheAhmadOsman @thdxr @taalas_inc @ctnzr @LOFI911