shuyang

69 posts

shuyang banner
shuyang

shuyang

@_shuyang_

tech, fashion, art. opinions my own. tell me something anonymously: https://t.co/o3U7Em6L3r

New York, NY Katılım Mart 2024
314 Takip Edilen36 Takipçiler
shuyang
shuyang@_shuyang_·
gpt-image-2 is good and all but nano banana still holds its own. (prompt: generate an image of jorge luis borges’s library of babel in the style of mc escher’s print gallery; first one is gpt and second is nano banana; 🍌 still “gets” it more)
shuyang tweet mediashuyang tweet media
English
0
0
0
94
shuyang
shuyang@_shuyang_·
Chinese users are paying $200 for Claude subscriptions they’re not supposed to have and calling it a “supply chain risk.” I looked into the absurd techniques they try to keep their accounts alive— open.substack.com/pub/shuyangli/…
shuyang tweet media
English
1
0
0
32
shuyang
shuyang@_shuyang_·
I had been making my Hermes agent do all my diet and exercise tracking, and I think our relationship to computers will change very soon. I wrote about it on Substack: open.substack.com/pub/shuyangli/…
English
0
0
1
65
shuyang
shuyang@_shuyang_·
what timeline is this
shuyang tweet media
English
0
0
0
42
shuyang
shuyang@_shuyang_·
✅ never triggered the claude code curse word classifier (regex)
shuyang tweet media
English
0
0
1
48
shuyang
shuyang@_shuyang_·
i think the discourse of anthropic’s emotions paper is missing the point - how we communicate naturally implies emotion, so it’s natural for llms to model emotions. anthropic is being reasonable labeling them as “functional emotions”
English
0
0
0
15
shuyang
shuyang@_shuyang_·
how well can LLMs play balatro? stay tuned
shuyang tweet media
English
0
0
1
25
shuyang
shuyang@_shuyang_·
are labs running evals on the long-term consequence of agents’ actions today? like repeated game playing, maintenance cost, complexity, etc?
English
1
0
0
23
shuyang
shuyang@_shuyang_·
banger book. interesting off-topic takeaway for our age - important systems need sustainment, which requires interpreting the *intention behind the system* and allowing agents to improvise when conditions change - multiple wars were won and lost on agency
shuyang tweet media
English
0
0
0
33
shuyang
shuyang@_shuyang_·
@barrald good take! however i might bet on linear? you can't get the planning and feedback context from code / PRs, and linear can better decide whether something is even worth shipping that being said, all i wish for is for github to actually serve git ops with 99.99% reliability...
English
1
0
0
26
Barry McCardel
Barry McCardel@barrald·
People gave me shit for this but the opportunity to be the new "hub" for all the coding agent spokes is massive If Cursor shipped cloud repos & code review there would be an absolute stampede Meanwhile, I was feeling masochistic this morning an tried to use Github's chat 💀
Barry McCardel tweet mediaBarry McCardel tweet media
Barry McCardel@barrald

The Cursor vs. Claude/Codex feels very flawed and missing the bigger picture The labs have ~infinite money and specialized talent, and are going to win on coding models – that's a runaway train. Composer is impressive, but ultimately more for margin protection / defense than playing to win that game. But to quote Stringer Bell, "there are games beyond the game" and I believe Cursor's destiny is different: becoming the new Github – the place where the whole engineering process lives. Their real competition is with them and Linear, not the labs. Bugbot is a great start. We find it super valuable, no matter what coding agent is used, and is a nice wedge into Cursor getting beyond the coding itself. And of course acquiring @graphite. PR review is the single most essential workflow in Github and very ripe for disruption – Cursor is in an amazing position for this. More on the horizon. The cloud sandbox thing is going to be huge. The new Automations thing aims at GH Actions. And it wouldn't surprise me to see them start getting more into security, observability, etc. Could Anthropic/OpenAI try to compete here? Sure. But I don't think customers want them to. I want my coding agent to be a coding agent and would be happy to pay for another model-agnostic system that sits across the whole menagerie and help me manage it. My prediction is that in a year we'll look back at the "Claude Code is great, therefore Cursor is cooked" discourse as misguided, and understand Cursor as playing a different game entirely.

English
4
0
10
3.4K
shuyang
shuyang@_shuyang_·
@trq212 @glawsontweets how is it not a subagent if it's backed by sonnet 4.6? (do you put a different classifier head instead of the lm head on the output of some intermediate layers?) if true, interpretability paying off!
shuyang tweet media
English
0
0
0
35
Thariq
Thariq@trq212·
@glawsontweets that's not what this is, subagents are too expensive and take too long to run on permissions, we use a special classifier
English
2
0
2
487
shuyang
shuyang@_shuyang_·
@thebigmehtaphor llm engineering is a systems problem of how to manage non-deterministic components
English
0
0
1
23
Viraj Mehta
Viraj Mehta@thebigmehtaphor·
going through a bunch of Autopilot traces: Autopilot's adaptive routing correctly de-prioritized its own bad predictions. Even when offline eval ranked the initial variant first, the live A/B test routed traffic to the actual winner. The system corrected itself.
English
1
0
2
113
shuyang retweetledi
TensorZero
TensorZero@TensorZero·
Can an automated AI engineer autonomously debug and optimize an LLM pipeline in 5 minutes? Last night, ours did: it cut errors in ~half during its first live demo. TensorZero Autopilot (our automated AI engineer) analyzed hundreds of historical LLM traces to identify failure modes, tuned the prompt, and verified improvements with an LLM judge — autonomously, in <5 minutes. With more time, it can do much more: from model selection to fine-tuning to adaptive experimentation, TensorZero Autopilot dramatically improves the performance of LLM agents across diverse tasks. Learn more below ↓
TensorZero tweet mediaTensorZero tweet media
English
1
4
7
3.2K
shuyang
shuyang@_shuyang_·
people on the subway are watching ai-generated animal videos now. what have we done?
English
0
0
0
19
shuyang
shuyang@_shuyang_·
@camsoft2000 fighting the same battle here. how much have you tried giving design docs to agents, and has that helped? i find that forcing agents read/write docs first help them focus on the the right codebase areas, but the code still spins out of control eventually, but maybe skill issue
English
0
0
0
125
camsoft2000
camsoft2000@camsoft2000·
I’m getting to the point with one of the projects I work on where the complexity of AI slop is becoming a real issue. While I can still happily prompt the agent to add x feature and it will do so and it will likely work perfectly, the code is just getting too complex and fragmented. Agents love to copy and paste and keeping patterns DRY is a real challenge. The agent will start diverging all those copy and pastes until you’ve got loads of similar but slightly different blocks of logic. Again it all still works and solves the problem I’m after. But I just can’t get any kind of consistency anymore, the code is a mess and I just don’t have a handle on it. I want a clean unified architecture but agents just code with tunnel vision. The project is now too big and complex for an agent to fully reason with and too big and complex for me to reason with. The only real solution is a complete rewrite. Maybe this is the way things will go. Code will just become disposable. I don’t really want to care about the code and to be honest I don’t but I do care about consistency and maintainability and the AI slop is hurting those very things I do care about. I know some will say “I’m holding it wrong”, use x,y,z skill, tool whatever and already use tools and anti slop skills, plans, docs, etc but the outcome is the same. Vibe coding something into existence is truly magical. But turning it into a mature product with months of iterations is painful. I can’t even hand code this thing because I don’t understand the code anymore and I’m too lazy to try and code myself because I’m addicted to AI. So what’s the solution, either start again and accept that’s just the way we have to roll, or just carry on fighting the slop and accept each new feature will take longer to implement than the last. I’m tired. I’m addicted.
English
165
37
602
84.4K