keysersoze

754 posts

keysersoze banner
keysersoze

keysersoze

@Surajdotdot7

building things with AI Claude Code | agents | automation sharing what I learn — for people who want to build, not just watch

Chennai Присоединился Ağustos 2019
808 Подписки105 Подписчики
keysersoze
keysersoze@Surajdotdot7·
@jerrod_lew The setup is the real work. Once the system is locked, you're just running output. Same thing in fashion video production — months building the pipeline, now $0.63/video at scale. The hard part is always infrastructure, not the creativity after.
English
0
0
0
12
Jerrod Lew
Jerrod Lew@jerrod_lew·
Creating social media carousels with Claude Design. Once you have that design system setup, it's just full creativity ahead!
English
13
11
184
9.1K
keysersoze
keysersoze@Surajdotdot7·
@RoundtableSpace Built video pipelines at 8k+ assets/month, fully automated. The clip/edit piece is real. The 24/7 part is where it gets hard — rate limits, quality gates, knowing when to kill a run because output degraded. Claude is good. The orchestration around it is the actual work.
English
0
0
0
9
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
CLAUDE CAN NOW CLIP, EDIT, SCHEDULE, AND POST CONTENT FOR YOU 24/7
English
18
14
200
59.8K
keysersoze
keysersoze@Surajdotdot7·
@0xSero Same pattern in production Claude pipelines. Extended thinking on structured agent steps — tool selection, JSON output, routing — adds latency with zero quality gain. Reserve it for genuine multi-step reasoning. Basic loops don't need it and it shows in the benchmarks.
English
0
0
0
16
0xSero
0xSero@0xSero·
Do you want to increase Qwen3.6-35B's performance significantly? turn off thinking for basic agent and all coding tasks you should try it if you have the vram.
0xSero tweet media
English
39
19
477
18.8K
keysersoze
keysersoze@Surajdotdot7·
@kimmonismus "No end to the rainbow" is the line builders should focus on. The China timeline is geopolitics. The capability ceiling keeps moving is what changes what you can ship — 6 months ago my current pipeline wasn't economically viable to build.
English
0
0
3
887
Chubby♨️
Chubby♨️@kimmonismus·
Dario Amodei: China will have a replicate of Mythos capabilties within 12 months. He also says: “There’s no end to the rainbow. There’s just the rainbow,” he says. “We don’t see anything slowing down." For anyone who doubted that China Mythos is lagging far behind: Dario believes the opposite!
Chubby♨️ tweet media
English
26
27
346
17.8K
keysersoze
keysersoze@Surajdotdot7·
@RoundtableSpace The cost math is real — we run $0.63/video vs $3k+ traditional shoots at 8k videos/month. What the tutorial skips: QC at scale, frame drift, client rejection cycles. That part doesn't fit in 16 minutes. It takes months of prod failures.
English
0
0
0
10
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
THIS GUY JUST DROPPED A 16 MIN TUTORIAL ON USING GEMINI 3.1 + SEEDANCE 2.0 TO BUILD CINEMATIC $10K WEBSITES
English
89
740
8.5K
639.3K
keysersoze
keysersoze@Surajdotdot7·
@birdabo 93% on simulated patients. Real patients bring incomplete histories, contradictory symptoms, weird edge cases. Every production pipeline I've built breaks exactly at that gap. That's the actual test — not MedQA.
English
0
0
0
36
sui ☄️
sui ☄️@birdabo·
🚨CHINA’S MEDICAL LLMs ARE NOW LIVE IN HOSPITALS. there’s 42 LLM powered doctors and nurses across 21 specialties in a hospital in tsinghua. they ran around 10k+ simulated patients through it in just days and hit 93.06% accuracy on MedQA. this usually would take doctors years to process. and this isn’t just a research paper btw. Hainan Boao opened China’s first fully AI native hospital recently along with DeepSeek medical LLMs already running in 260+ real hospitals across the china. - while everyone else publishes benchmarks, China is treating actual patients with it. insane. China seems to be aggressively pushing medical AI in real hospitals faster than most countries.
English
39
121
712
41.6K
keysersoze
keysersoze@Surajdotdot7·
@RoundtableSpace 4.6 beating 4.7 on complex tasks doesn't surprise me. Running multi-step Claude Code pipelines at scale — newer version ≠ better on sustained long-context work. Benchmark categories rarely map to what actually breaks at step 40 of a 50-step agentic run.
English
0
0
1
27
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
ARENA AI LEADERBOARD: OPUS 4.7 VS 4.6 - Opus 4.7 ranks #1 or #2 in most categories (text, coding, expert, hard prompts, instruction following, creative writing) - Opus 4.6 beats 4.7 on longer queries, complex tasks, and domain-specific areas (business, science, software) - Community split: 4.7 stronger on short tasks but 50% more expensive and loses on hard stuff Crazy improvements.
0xMarioNawfal tweet media
English
12
1
71
48.1K
keysersoze
keysersoze@Surajdotdot7·
@kimmonismus Running this math at micro scale already. Replaced what used to need a video shoot team with a $0.63/video pipeline. The payroll→infrastructure trade isn't a Meta story. It's every company running the numbers right now.
English
0
0
0
101
Chubby♨️
Chubby♨️@kimmonismus·
Meta layoffs investors had been bracing for are coming, with roughly 8,000 jobs cut starting May 20, about 10% of its 79,000-person workforce. Mainly to free up billions for AI infrastructure, shifting resources from payroll to data centers, chips, and advanced models as highlighted by Mark Zuckerberg.
Chubby♨️ tweet media
English
14
6
120
8.5K
keysersoze
keysersoze@Surajdotdot7·
@viktoroddy 18 mins to build it. Weeks to get it working inside an actual brand system with 5 years of design decisions already baked in. The demo is never the hard part.
English
1
0
2
1.1K
Viktor Oddy
Viktor Oddy@viktoroddy·
Claude Design is insane. ❤️‍🔥Just recorded a 18-min tutorial on how to build animated, award-winning websites with Claude Design + Opus 4.7!
English
98
480
6.1K
342.1K
keysersoze
keysersoze@Surajdotdot7·
@DavidOndrej1 Why can't they expand get new compute from nividia or AWS or is it more of a political problem
English
3
0
6
1.4K
David Ondrej
David Ondrej@DavidOndrej1·
Dwarkesh was right. Anthropic is running out of compute.
English
31
9
526
35.1K
keysersoze
keysersoze@Surajdotdot7·
@ZypZapCommunity I’m merging the Creative Director and Video Planner into one agent to slash token usage and latency. Instead of chatting, one agent now sets the shot order (1st/last frames). This flows to JSON -> Prompt Director optimization -> Gen -> Editor for stitching. Much leaner.
English
1
0
0
17
ZypZap
ZypZap@ZypZapCommunity·
@Surajdotdot7 That agent stack sounds wild for turning product shots into clips. Does the creative director actually decide transitions or just shot order?
English
1
0
1
8
keysersoze
keysersoze@Surajdotdot7·
I understand, but what I mean is I am building an agentic product that turns e-commerce catalogs (5 images) into clips and performs video editing. It has 5 specialized agents called creative director, video planner, prompt director (kling), video editor, and QA analyst. It runs perfectly in Sonnet or Opus, but I tried Qwen 3.5, Gemma 4, and other local models. Based on benchmarks, it often fails to do what it should do. I'm tired of seeing benchmarks that don't actually work in production-ready settings. I often see failure rates in agentic tasks, so I want to build a benchmark where I can test with real use cases so we can see what is working or not. The whole point of LLMs is to be useful to us, rather than showing a lot of percentage numbers that we don't use anyway. That's why I like a benchmark like @bridgemindai , where he tests with real use cases. I'm going to build something similar where I will test it with simple to complex agentic products so people can see where they can use it and evaluate it.
POM@peterom

@Surajdotdot7 There are benchmarks included for agentic tool use

English
1
0
2
85
David Ondrej
David Ondrej@DavidOndrej1·
I'm spending ±$6,000 a month on Anthropic API you?
David Ondrej tweet media
English
48
0
66
6.3K
keysersoze
keysersoze@Surajdotdot7·
@kimmonismus Production reality: evals said better, pipeline disagreed. Rolled back model versions twice at 8M Studio because benchmark improvements didn't translate to our actual workload. More tokens in adaptive thinking means nothing if the outputs regress on your specific task.
English
0
0
0
400
Chubby♨️
Chubby♨️@kimmonismus·
Opus 4.7 does seem to have improved, and its adaptive thinking now uses more tokens. However, compared to Opus 4.6, it still performs significantly worse.
English
47
12
437
22.6K
keysersoze
keysersoze@Surajdotdot7·
@KyleHessling1 If that delta holds, the inference cost story changes completely. We run pipelines processing thousands of jobs/month — model size directly maps to server cost. A 27B that punches up is worth more than a 235B that barely edges it.
English
0
0
1
233
Kyle Hessling
Kyle Hessling@KyleHessling1·
Am I mistaken that if the delta holds as seen between the Qwen 3.6 35b MOE and the Qwen 3.5 35b MOE, that the 3.6 dense 27B delta will unseat Kimi k2.5 at less than 3% of the model size? You remember when we were all considering buying 2 or 4 mac studios just to run REAP prunes in Q1 to run that model? We could soon have similar capability on a 3090. Exciting acceleration, to say the least!
Kyle Hessling tweet media
English
19
10
192
18K
keysersoze
keysersoze@Surajdotdot7·
@bindureddy 80% on evals. Running 8k+ images/month through an agent pipeline, you learn fast that the bottom 20% is where all the failures live — consistency on edge cases, tool use reliability, following multi-step instructions. Benchmarks don't measure that. Will test it.
English
0
0
0
185
Bindu Reddy
Bindu Reddy@bindureddy·
The big story that everyone missed yesterday - Qwen 3.6 dropped with 3B active params costs nothing to run and delivers 80% of Opus 4.7’s performance 🤯 Open source is making giant leaps
English
76
63
777
43.1K
keysersoze
keysersoze@Surajdotdot7·
@TeksEdge Benchmark is one signal. We run 8k+ tasks/month through pipeline. The real test is task completion without retry loops at scale. 3x faster inference means nothing if production error rate doubles. Running it this week to check if the ts-bench holds outside controlled evals.
English
2
0
6
673
keysersoze
keysersoze@Surajdotdot7·
@garrytan @LouiseDSadeleer "Too chatty" kills pipeline costs before most people realize it. Every unnecessary word is a token — at 8k+ images/month through AI pipelines, verbose outputs compound fast. Good to see this taken seriously at the product level.
English
0
0
0
72
Garry Tan
Garry Tan@garrytan·
GStack is now at v1.0 General Release If you used it before and didn't like how chatty it was, we've fixed it. Thanks @LouiseDSadeleer for the incredible feedback. I love listening to real users because it is literally the way to making them happy with new product changes.
Garry Tan tweet media
English
15
4
77
10.6K
keysersoze
keysersoze@Surajdotdot7·
@aftermagics Curious how consistent the output is across different products. We run 8k+ images/month through a video pipeline — the hard part isn't making one good video, it's when product #4,000 still needs the same quality.
English
1
0
3
4.3K
keysersoze
keysersoze@Surajdotdot7·
@pierreeliottlal 12 prompts for a one-off is impressive. Getting it repeatable at scale is a different problem — our fashion video pipeline took months to lock down parameters that work consistently across 8k+ images/month without breaking every third batch.
English
0
0
0
66