keysersoze

754 posts

keysersoze

@Surajdotdot7

building things with AI Claude Code | agents | automation sharing what I learn — for people who want to build, not just watch

Chennai Присоединился Ağustos 2019

808 Подписки105 Подписчики

keysersoze@Surajdotdot7·20m

@jerrod_lew The setup is the real work. Once the system is locked, you're just running output. Same thing in fashion video production — months building the pipeline, now $0.63/video at scale. The hard part is always infrastructure, not the creativity after.

English

Jerrod Lew@jerrod_lew·20h

Creating social media carousels with Claude Design. Once you have that design system setup, it's just full creativity ahead!

English

184

9.1K

keysersoze@Surajdotdot7·23m

@RoundtableSpace Built video pipelines at 8k+ assets/month, fully automated. The clip/edit piece is real. The 24/7 part is where it gets hard — rate limits, quality gates, knowing when to kill a run because output degraded. Claude is good. The orchestration around it is the actual work.

English

0xMarioNawfal@RoundtableSpace·4h

CLAUDE CAN NOW CLIP, EDIT, SCHEDULE, AND POST CONTENT FOR YOU 24/7

English

200

59.8K

keysersoze@Surajdotdot7·58m

@0xSero Same pattern in production Claude pipelines. Extended thinking on structured agent steps — tool selection, JSON output, routing — adds latency with zero quality gain. Reserve it for genuine multi-step reasoning. Basic loops don't need it and it shows in the benchmarks.

English

0xSero@0xSero·8h

Do you want to increase Qwen3.6-35B's performance significantly? turn off thinking for basic agent and all coding tasks you should try it if you have the vram.

English

477

18.8K

keysersoze@Surajdotdot7·1h

@kimmonismus "No end to the rainbow" is the line builders should focus on. The China timeline is geopolitics. The capability ceiling keeps moving is what changes what you can ship — 6 months ago my current pipeline wasn't economically viable to build.

English

887

Chubby♨️@kimmonismus·1h

Dario Amodei: China will have a replicate of Mythos capabilties within 12 months. He also says: “There’s no end to the rainbow. There’s just the rainbow,” he says. “We don’t see anything slowing down." For anyone who doubted that China Mythos is lagging far behind: Dario believes the opposite!

English

346

17.8K

keysersoze@Surajdotdot7·1h

@RoundtableSpace The cost math is real — we run $0.63/video vs $3k+ traditional shoots at 8k videos/month. What the tutorial skips: QC at scale, frame drift, client rejection cycles. That part doesn't fit in 16 minutes. It takes months of prod failures.

English

0xMarioNawfal@RoundtableSpace·1d

THIS GUY JUST DROPPED A 16 MIN TUTORIAL ON USING GEMINI 3.1 + SEEDANCE 2.0 TO BUILD CINEMATIC $10K WEBSITES

English

740

8.5K

639.3K

keysersoze@Surajdotdot7·1h

@birdabo 93% on simulated patients. Real patients bring incomplete histories, contradictory symptoms, weird edge cases. Every production pipeline I've built breaks exactly at that gap. That's the actual test — not MedQA.

English

sui ☄️@birdabo·12h

🚨CHINA’S MEDICAL LLMs ARE NOW LIVE IN HOSPITALS. there’s 42 LLM powered doctors and nurses across 21 specialties in a hospital in tsinghua. they ran around 10k+ simulated patients through it in just days and hit 93.06% accuracy on MedQA. this usually would take doctors years to process. and this isn’t just a research paper btw. Hainan Boao opened China’s first fully AI native hospital recently along with DeepSeek medical LLMs already running in 260+ real hospitals across the china. - while everyone else publishes benchmarks, China is treating actual patients with it. insane. China seems to be aggressively pushing medical AI in real hospitals faster than most countries.

English

121

712

41.6K

keysersoze@Surajdotdot7·1h

@RoundtableSpace 4.6 beating 4.7 on complex tasks doesn't surprise me. Running multi-step Claude Code pipelines at scale — newer version ≠ better on sustained long-context work. Benchmark categories rarely map to what actually breaks at step 40 of a 50-step agentic run.

English

0xMarioNawfal@RoundtableSpace·2h

ARENA AI LEADERBOARD: OPUS 4.7 VS 4.6 - Opus 4.7 ranks #1 or #2 in most categories (text, coding, expert, hard prompts, instruction following, creative writing) - Opus 4.6 beats 4.7 on longer queries, complex tasks, and domain-specific areas (business, science, software) - Community split: 4.7 stronger on short tasks but 50% more expensive and loses on hard stuff Crazy improvements.

English

48.1K

keysersoze@Surajdotdot7·2h

@kimmonismus Running this math at micro scale already. Replaced what used to need a video shoot team with a $0.63/video pipeline. The payroll→infrastructure trade isn't a Meta story. It's every company running the numbers right now.

English

101

Chubby♨️@kimmonismus·3h

Meta layoffs investors had been bracing for are coming, with roughly 8,000 jobs cut starting May 20, about 10% of its 79,000-person workforce. Mainly to free up billions for AI infrastructure, shifting resources from payroll to data centers, chips, and advanced models as highlighted by Mark Zuckerberg.

English

120

8.5K

keysersoze@Surajdotdot7·2h

@viktoroddy 18 mins to build it. Weeks to get it working inside an actual brand system with 5 years of design decisions already baked in. The demo is never the hard part.

English

1.1K

Viktor Oddy@viktoroddy·5h

Claude Design is insane. ❤️‍🔥Just recorded a 18-min tutorial on how to build animated, award-winning websites with Claude Design + Opus 4.7!

English

480

6.1K

342.1K

keysersoze@Surajdotdot7·5h

@DavidOndrej1 Why can't they expand get new compute from nividia or AWS or is it more of a political problem

English

1.4K

David Ondrej@DavidOndrej1·10h

Dwarkesh was right. Anthropic is running out of compute.

English

526

35.1K

keysersoze@Surajdotdot7·6h

@ZypZapCommunity I’m merging the Creative Director and Video Planner into one agent to slash token usage and latency. Instead of chatting, one agent now sets the shot order (1st/last frames). This flows to JSON -> Prompt Director optimization -> Gen -> Editor for stitching. Much leaner.

English

ZypZap@ZypZapCommunity·6h

@Surajdotdot7 That agent stack sounds wild for turning product shots into clips. Does the creative director actually decide transitions or just shot order?

English

keysersoze@Surajdotdot7·8h

I understand, but what I mean is I am building an agentic product that turns e-commerce catalogs (5 images) into clips and performs video editing. It has 5 specialized agents called creative director, video planner, prompt director (kling), video editor, and QA analyst. It runs perfectly in Sonnet or Opus, but I tried Qwen 3.5, Gemma 4, and other local models. Based on benchmarks, it often fails to do what it should do. I'm tired of seeing benchmarks that don't actually work in production-ready settings. I often see failure rates in agentic tasks, so I want to build a benchmark where I can test with real use cases so we can see what is working or not. The whole point of LLMs is to be useful to us, rather than showing a lot of percentage numbers that we don't use anyway. That's why I like a benchmark like @bridgemindai , where he tests with real use cases. I'm going to build something similar where I will test it with simple to complex agentic products so people can see where they can use it and evaluate it.

POM@peterom

@Surajdotdot7 There are benchmarks included for agentic tool use

English

keysersoze@Surajdotdot7·7h

@th3_m0l3 @DavidOndrej1 @ClaudeDevs build after they blocked third-party OAUTH, so using with 3 max accounts, so yeah, it's useful, but I won't recommend it.

English

random_hero@th3_m0l3·7h

@Surajdotdot7 @DavidOndrej1 @ClaudeDevs legit?

English

David Ondrej@DavidOndrej1·1d

I'm spending ±$6,000 a month on Anthropic API you?

English

6.3K

keysersoze@Surajdotdot7·7h

@kimmonismus Production reality: evals said better, pipeline disagreed. Rolled back model versions twice at 8M Studio because benchmark improvements didn't translate to our actual workload. More tokens in adaptive thinking means nothing if the outputs regress on your specific task.

English

400

Chubby♨️@kimmonismus·8h

Opus 4.7 does seem to have improved, and its adaptive thinking now uses more tokens. However, compared to Opus 4.6, it still performs significantly worse.

English

437

22.6K

keysersoze@Surajdotdot7·8h

@KyleHessling1 If that delta holds, the inference cost story changes completely. We run pipelines processing thousands of jobs/month — model size directly maps to server cost. A 27B that punches up is worth more than a 235B that barely edges it.

English

233

Kyle Hessling@KyleHessling1·14h

Am I mistaken that if the delta holds as seen between the Qwen 3.6 35b MOE and the Qwen 3.5 35b MOE, that the 3.6 dense 27B delta will unseat Kimi k2.5 at less than 3% of the model size? You remember when we were all considering buying 2 or 4 mac studios just to run REAP prunes in Q1 to run that model? We could soon have similar capability on a 3090. Exciting acceleration, to say the least!

English

192

18K

keysersoze@Surajdotdot7·8h

@bindureddy 80% on evals. Running 8k+ images/month through an agent pipeline, you learn fast that the bottom 20% is where all the failures live — consistency on edge cases, tool use reliability, following multi-step instructions. Benchmarks don't measure that. Will test it.

English

185

Bindu Reddy@bindureddy·11h

The big story that everyone missed yesterday - Qwen 3.6 dropped with 3B active params costs nothing to run and delivers 80% of Opus 4.7’s performance 🤯 Open source is making giant leaps

English

777

43.1K

keysersoze@Surajdotdot7·8h

@TeksEdge Benchmark is one signal. We run 8k+ tasks/month through pipeline. The real test is task completion without retry loops at scale. 3x faster inference means nothing if production error rate doubles. Running it this week to check if the ts-bench holds outside controlled evals.

English

673

David Hendrickson@TeksEdge·13h

🔥 Qwen3.6-35B-A3B is INSANELY strong 🔥 ✅ 100% ts-bench success rate (with opencode, vibe-local, GitHub Copilot, qwencode & Claude Code) ⚡ Matches Claude Sonnet 4.6 & Opus 4.6 task speed ⚡ 3x faster inference than Qwen3.5-27B → way shorter completion times 💻 Runs on consumer gear: Mac (32GB+ RAM) or RTX 3090/4090/5090

金のニワトリ@gosrum

Qwen3.6-35B-A3Bが強すぎる！！！・opencode,vibe-local,GitHub Copilot,qwencode,claude codeと組み合わせたときのts-benchを実施したところ、すべて満点・しかもClaude sonnet 4.6やOpus 4.6と同じくらい速くタスクを遂行できている Qwen3.5-27Bもすごかったが、Qwen3.6-35B-A3Bは赤い彗星のごとく27Bよりも推論速度が3倍速いので、ベンチマーク結果からもわかるようにタスク遂行までの時間が大幅に短縮できるようになったのが大きい

English

560

43.9K

keysersoze@Surajdotdot7·8h

@garrytan @LouiseDSadeleer "Too chatty" kills pipeline costs before most people realize it. Every unnecessary word is a token — at 8k+ images/month through AI pipelines, verbose outputs compound fast. Good to see this taken seriously at the product level.

English

Garry Tan@garrytan·9h

GStack is now at v1.0 General Release If you used it before and didn't like how chatty it was, we've fixed it. Thanks @LouiseDSadeleer for the incredible feedback. I love listening to real users because it is literally the way to making them happy with new product changes.

English

10.6K

keysersoze@Surajdotdot7·8h

@aftermagics Curious how consistent the output is across different products. We run 8k+ images/month through a video pipeline — the hard part isn't making one good video, it's when product #4,000 still needs the same quality.

English

4.3K

Aftermagics@aftermagics·16h

I created this video using Claude Design in just 15 prompts.

Claude@claudeai

Introducing Claude Design by Anthropic Labs: make prototypes, slides, and one-pagers by talking to Claude. Powered by Claude Opus 4.7, our most capable vision model. Available in research preview on the Pro, Max, Team, and Enterprise plans, rolling out throughout the day.

English

1.6K

183.4K

keysersoze@Surajdotdot7·9h

@pierreeliottlal 12 prompts for a one-off is impressive. Getting it repeatable at scale is a different problem — our fashion video pipeline took months to lock down parameters that work consistently across 8k+ images/month without breaking every third batch.

English

Pierre-Eliott Lallemant@pierreeliottlal·19h

Made this video with claude in 12 prompts

Claude@claudeai

English

324.4K

Открыть

@jerrod_lew @RoundtableSpace @0xSero @kimmonismus @birdabo @viktoroddy @DavidOndrej1 @elonmusk