shuyang

69 posts

shuyang

@_shuyang_

tech, fashion, art. opinions my own. tell me something anonymously: https://t.co/o3U7Em6L3r

New York, NY Katılım Mart 2024

314 Takip Edilen36 Takipçiler

shuyang@_shuyang_·1d

gpt-image-2 is good and all but nano banana still holds its own. (prompt: generate an image of jorge luis borges’s library of babel in the style of mc escher’s print gallery; first one is gpt and second is nano banana; 🍌 still “gets” it more)

English

shuyang@_shuyang_·2d

@mhui1109 now that i have free time…

English

Michelle@mhui1109·2d

@_shuyang_ on a creative substack spree

English

shuyang@_shuyang_·2d

Chinese users are paying $200 for Claude subscriptions they’re not supposed to have and calling it a “supply chain risk.” I looked into the absurd techniques they try to keep their accounts alive— open.substack.com/pub/shuyangli/…

English

shuyang@_shuyang_·5d

I had been making my Hermes agent do all my diet and exercise tracking, and I think our relationship to computers will change very soon. I wrote about it on Substack: open.substack.com/pub/shuyangli/…

English

shuyang@_shuyang_·16 Nis

what timeline is this

English

shuyang@_shuyang_·10 Nis

are we making art or are our uploaded brains making art

Xiao Ma@infoxiao

who wants to do an art project with me

English

shuyang retweetledi

Mitchell Hashimoto@mitchellh·7 Nis

x.com/i/article/2041…

ZXX

153

1.3K

300.9K

shuyang@_shuyang_·5 Nis

✅ never triggered the claude code curse word classifier (regex)

English

shuyang@_shuyang_·5 Nis

i think the discourse of anthropic’s emotions paper is missing the point - how we communicate naturally implies emotion, so it’s natural for llms to model emotions. anthropic is being reasonable labeling them as “functional emotions”

English

shuyang@_shuyang_·31 Mar

how well can LLMs play balatro? stay tuned

English

shuyang@_shuyang_·30 Mar

ok this just in scbench.ai

English

shuyang@_shuyang_·26 Mar

are labs running evals on the long-term consequence of agents’ actions today? like repeated game playing, maintenance cost, complexity, etc?

English

shuyang@_shuyang_·30 Mar

banger book. interesting off-topic takeaway for our age - important systems need sustainment, which requires interpreting the *intention behind the system* and allowing agents to improvise when conditions change - multiple wars were won and lost on agency

English

shuyang@_shuyang_·27 Mar

@barrald good take! however i might bet on linear? you can't get the planning and feedback context from code / PRs, and linear can better decide whether something is even worth shipping that being said, all i wish for is for github to actually serve git ops with 99.99% reliability...

English

Barry McCardel@barrald·26 Mar

People gave me shit for this but the opportunity to be the new "hub" for all the coding agent spokes is massive If Cursor shipped cloud repos & code review there would be an absolute stampede Meanwhile, I was feeling masochistic this morning an tried to use Github's chat 💀

Barry McCardel@barrald

The Cursor vs. Claude/Codex feels very flawed and missing the bigger picture The labs have ~infinite money and specialized talent, and are going to win on coding models – that's a runaway train. Composer is impressive, but ultimately more for margin protection / defense than playing to win that game. But to quote Stringer Bell, "there are games beyond the game" and I believe Cursor's destiny is different: becoming the new Github – the place where the whole engineering process lives. Their real competition is with them and Linear, not the labs. Bugbot is a great start. We find it super valuable, no matter what coding agent is used, and is a nice wedge into Cursor getting beyond the coding itself. And of course acquiring @graphite. PR review is the single most essential workflow in Github and very ripe for disruption – Cursor is in an amazing position for this. More on the horizon. The cloud sandbox thing is going to be huge. The new Automations thing aims at GH Actions. And it wouldn't surprise me to see them start getting more into security, observability, etc. Could Anthropic/OpenAI try to compete here? Sure. But I don't think customers want them to. I want my coding agent to be a coding agent and would be happy to pay for another model-agnostic system that sits across the whole menagerie and help me manage it. My prediction is that in a year we'll look back at the "Claude Code is great, therefore Cursor is cooked" discourse as misguided, and understand Cursor as playing a different game entirely.

English

3.4K

shuyang@_shuyang_·26 Mar

@trq212 @glawsontweets how is it not a subagent if it's backed by sonnet 4.6? (do you put a different classifier head instead of the lm head on the output of some intermediate layers?) if true, interpretability paying off!

English

Thariq@trq212·25 Mar

@glawsontweets that's not what this is, subagents are too expensive and take too long to run on permissions, we use a special classifier

English

487

shuyang@_shuyang_·25 Mar

@thebigmehtaphor llm engineering is a systems problem of how to manage non-deterministic components

English

Viraj Mehta@thebigmehtaphor·25 Mar

going through a bunch of Autopilot traces: Autopilot's adaptive routing correctly de-prioritized its own bad predictions. Even when offline eval ranked the initial variant first, the live A/B test routed traffic to the actual winner. The system corrected itself.

English

113

shuyang retweetledi

TensorZero@TensorZero·25 Mar

Can an automated AI engineer autonomously debug and optimize an LLM pipeline in 5 minutes? Last night, ours did: it cut errors in ~half during its first live demo. TensorZero Autopilot (our automated AI engineer) analyzed hundreds of historical LLM traces to identify failure modes, tuned the prompt, and verified improvements with an LLM judge — autonomously, in <5 minutes. With more time, it can do much more: from model selection to fine-tuning to adaptive experimentation, TensorZero Autopilot dramatically improves the performance of LLM agents across diverse tasks. Learn more below ↓

English

3.2K

shuyang@_shuyang_·23 Mar

recursive self-improvement wen?

TensorZero@TensorZero

We’re building TensorZero Autopilot, an automated AI engineer that analyzes LLM observability data, optimizes prompts and models, sets up evals, and runs A/B tests. It dramatically improves the performance of LLM agents on every single benchmark we’ve tried. Read more below.

English

shuyang@_shuyang_·23 Mar

people on the subway are watching ai-generated animal videos now. what have we done?

English

shuyang@_shuyang_·23 Mar

@camsoft2000 fighting the same battle here. how much have you tried giving design docs to agents, and has that helped? i find that forcing agents read/write docs first help them focus on the the right codebase areas, but the code still spins out of control eventually, but maybe skill issue

English

125

camsoft2000@camsoft2000·23 Mar

I’m getting to the point with one of the projects I work on where the complexity of AI slop is becoming a real issue. While I can still happily prompt the agent to add x feature and it will do so and it will likely work perfectly, the code is just getting too complex and fragmented. Agents love to copy and paste and keeping patterns DRY is a real challenge. The agent will start diverging all those copy and pastes until you’ve got loads of similar but slightly different blocks of logic. Again it all still works and solves the problem I’m after. But I just can’t get any kind of consistency anymore, the code is a mess and I just don’t have a handle on it. I want a clean unified architecture but agents just code with tunnel vision. The project is now too big and complex for an agent to fully reason with and too big and complex for me to reason with. The only real solution is a complete rewrite. Maybe this is the way things will go. Code will just become disposable. I don’t really want to care about the code and to be honest I don’t but I do care about consistency and maintainability and the AI slop is hurting those very things I do care about. I know some will say “I’m holding it wrong”, use x,y,z skill, tool whatever and already use tools and anti slop skills, plans, docs, etc but the outcome is the same. Vibe coding something into existence is truly magical. But turning it into a mature product with months of iterations is painful. I can’t even hand code this thing because I don’t understand the code anymore and I’m too lazy to try and code myself because I’m addicted to AI. So what’s the solution, either start again and accept that’s just the way we have to roll, or just carry on fighting the slop and accept each new feature will take longer to implement than the last. I’m tired. I’m addicted.

English

165

602

84.4K

Keşfet

@mhui1109 @barrald @trq212 @glawsontweets @thebigmehtaphor @elonmusk @BarackObama @taylorswift13