Maedah Batool (@MaedahBatool) - Twitter Profili

Sabitlenmiş Tweet

Maedah Batool@MaedahBatool·16 Mar

New role: Product Lead at @CommandCodeAI 🥳 After building developer experience at Vercel (Next.js) and then Sourcegraph, I'm going all in on what comes next. Excited to announce that I've joined Command Code to build and shape the next frontier of developer experience. I'll be working across Product and GTM to help developers take command of their code. Command Code is the first frontier coding agent that both builds software and continuously learns your coding taste via `taste-1`, our meta neuro-symbolic AI model. We're hiring in SF and globally. Come work with us on building the future of agentic engineering. Let's goooo!!

English

9

2

25

1.6K

Maedah Batool retweetledi

Command Code@CommandCodeAI·1h

MiMo-V2.5-Pro & MiMo-V2.5 are now ~99% off on Command Code. This is like 100x more usage! Input, output, and cache pricing are all lower. Works on every plan + extra top-ups. Pick the /model and go! Our $1 Go plan with $10 in it is perfect for this. Let's go!

English

6

5

91

2.7K

Maedah Batool@MaedahBatool·1h

@SantoshYadavDev @appjsconf Safe travels.

English

0

7

Santosh Yadav@SantoshYadavDev·18h

Hey folks coming to @appjsconf see you soon 😊

English

3

0

35

714

Command Code@CommandCodeAI·3h

Big!! We just shipped 36K+ tool repairs in `command-code@0.28.0` This would improve 29 different models and 36K+ tool errors won't happen. cmd will also show the repair icon when it repairs a tool call. Or how many times it does that. Not all repairs will show the icon btw.

Ahmad Awais@MrAhmadAwais

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending `null` for an optional field instead of omitting it - emitting `["a","b"]` as a json *string* instead of an actual array - wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. deepseek-flash, when asked to edit or write a file, sometimes emits the path as a *markdown auto-link*: filePath: "/Users/x/proj/[notes.md](http://notes. md)" our writeFile tool obediently trued creating files literally named `[notes.md](http://notes .md)` until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like `[click](https://x .com)` passes through untouched. this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict. "tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level `pathString()` instead of `z.string()` and the leak is plugged for every path field at once. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that *happened* to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test. then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log `tool_input_repaired:${toolName}`. on failure, log `tool_input_invalid:${toolName}` and return a model-readable retry message. the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence. (this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.) 4/ shape invariants and relational invariants need different fixes. the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a *relational* invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling `readFile({ absolutePath, limit: 30 })` and getting an `ERROR:` back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them. so i taught the function the model's intent instead. `limit` alone → `offset = 0`. `offset` alone → `limit = 2000` (matches common read tool ops default). then surfaced the decision back to the model in the result: "Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit." no `Error:` prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big. repair where you can. extend semantics where you can't. surface the choice either way. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model.

English

9

4

48

2.6K

Maedah Batool@MaedahBatool·3h

@CommandCodeAI Omg!! How much you folks ship. 😉

English

1

0

1

75

Ahmad Awais@MrAhmadAwais·10h

Command Code surpassed 1 trillion tokens mark today!!

English

20

2

66

3K

Maedah Batool@MaedahBatool·9h

@MrAhmadAwais We're killing it. Super proud of my team!!

English

0

1

30

Maedah Batool@MaedahBatool·9h

@WebDevCaptain @MrAhmadAwais @CommandCodeAI Thanks Shreyash. We're actively looking into options how to make it happen. :)

English

0

1

15

Shreyash@WebDevCaptain·10h

@MrAhmadAwais Recommended to 4 of my friends, and they are all subscribed, but there must be a referral program @MrAhmadAwais @CommandCodeAI

English

2

0

3

88

Maedah Batool@MaedahBatool·10h

@CommandCodeAI OMG!!! so good to see these numbers. Def more to come

English

0

12

Command Code@CommandCodeAI·1d

hit 600 billion ai tokens on Command Code today!!

Command Code@CommandCodeAI

The best coding agent plan doesn't exi……… A dollar for $20 of Qwen 3.7 Max usage? A dollar for $40 of DeepSeek V4 Pro usage? Hard to say no to that.

English

5

0

71

3.5K

Maedah Batool@MaedahBatool·10h

Yes we're growing at insane speed! 🔥

Command Code@CommandCodeAI

hit 600 billion ai tokens on Command Code today!!

English

0

2

85

Maedah Batool retweetledi

Command Code@CommandCodeAI·5d

Introducing "/design": the skill that turns Command Code into your terminal-native design partner. One command. 16 modes. Each one knows what to look for, what to change, and what to leave alone. npm i -g command-code cmd /design help

English

8

4

47

2.6K

Maedah Batool@MaedahBatool·1d

@yazanbruv @CommandCodeAI Looks great. Would love to feature your experience on our dev stories.

English

1

0

1

28

Yzn@yazanbruv·1d

i got 10/10 score on my android app /design that i made for less than $1 via deepseekv4 @CommandCodeAI

Ahmad Awais@MrAhmadAwais

how did we fix the ai design slop problem in llms - DeepSeek/Kimi/Qwen or Claude/GPT?! i've been thinking about "why do all ai-generated designs look the same?" is it a model problem or a harness problem? context: we're fixing the llm design problem with `/design` for @CommandCodeAI - atm it has 16 modes, 24 reference documents, ~4,500+ lines of encoded design taste from some of the best designers in the world. it reads your codebase, identifies what's broken, and edits real files. no figma. no markdown mockups. the output stops looking like ai slop. i've been staring at ai-generated uis for a while now and noticed something that i think is underappreciated: llms can write css fluently but have essentially zero design taste. and the failure mode is not random, it's a very specific, very small distribution. let me explain. when you ask a model to build a landing page, it reaches into the mode of its training distribution. the mode of all landing pages on the internet is: centered hero, gradient text, glassmorphism card, three identical feature tiles, indigo accent, Inter font, bounce animation. this is the "average website." the llm is doing exactly what we trained it to do - predicting the most likely next token given "build a landing page." the most likely landing page is the average landing page. the average landing page is mediocre by definition. this is not a capability problem. the model knows oklch(). it knows prefers-reduced-motion. it knows golden ratio. it knows how to set a 65ch measure. it just doesn't know when to use these things, because "when" is taste, and taste is not well-represented as a statistical prior over internet css. so we thought what if we gave every llm a design taste with `/design`. here's what we found: 1/ the failure design dataset is surprisingly small. we talked to a bunch of designers with great design taste and asked them to label AI-generated UIs. what are the tells? turns out there are basically ~10 and they account for ~90% of the "this looks AI-generated" signal: - tech gradient (blue-violet glossy energy on everything) - generic tech hue (indigo because "software" not purple btw) - feature tile grid (icon + heading + sentence x N, all equal weight, nothing prioritized) - accent rail (colored stripe on card edge = decoration pretending to be organization) - unearned blur (glassmorphism without a depth system) - stat monument (oversized numbers filling space where a product story belongs) - icon topper (rounded-square icon above every heading as template filler) - bounce everywhere (elastic easing because the API has it, not because it's purposeful) - default type (whatever font the training distribution likes this year) - center stack (everything centered because no composition decision was made) this is super similar to what we see in other llm tool failures. tool calling errors? 4-16 types. fixing that made deepseek outperform opus 4.7, i wrote about that before! so i started researching maybe a dozen common patterns are design tells? 10. the failure distribution is narrow and we could repair ai design. this means it's a tractable and deterministic problem. `/design smell` hunts all these and scores severity on a /10 scale. 2/ the deeper problem is compositional, not cosmetic. the more interesting thing i found was that most of these tells are symptoms, not causes. the actual bug is that the model chooses layout before it chooses purpose. a dashboard and a landing page have completely different jobs. a dashboard is a Monitor surface - status, alerts, metrics, live data. a landing page is a Decide surface - proof, risk reduction, one clear action. these need fundamentally different spatial compositions. but the LLM reaches for the same centered-hero-plus-cards layout for both, because that's the mode of the training distribution. so we built work-pattern-first composition. before the agent touches any visual property, it must identify which of 7 patterns the surface serves: - Monitor: status boards, alerts, metrics, live priority - Operate: command bars, canvases, inspectors, direct manipulation - Compare: tables, matrices, split views, ranked lists - Configure: grouped settings, forms, previews, commit areas - Learn: article flow, walkthrough rhythm, progressive sections - Decide: focused pitch, proof, risk reduction, one dominant action - Explore: search, filters, maps, galleries, reversible discovery this is essentially chain-of-thought for design - force the model to reason about the *purpose* of the layout before generating the layout. i think there's a general lesson here. when an LLM is generating something compositional (code, UI, writing), forcing it to commit to a structural frame *before* generating tokens within that frame helps a lot. it's the same reason chain-of-thought helps with math. you're reducing the entropy of the generation by conditioning on a high-level plan. this single constraint eliminated more generic-looking UIs than any aesthetic rule we wrote. many phenomenal skills exist in the space, i bet they had the taste for great design but didn't know they were fixing the chain-of-thought problem instead of the style problem. i think that's why their skills are super loopy instead of being reliably good. 3/ validate-then-repair, again. my first version tried to audit and fix design simultaneously. this what many design skills do and fail. it's the "preprocess" approach and it fails for the same reason it failed in tool calling: you're encoding a prior about what's broken, and you get false positives that silently corrupt things. it would recolor something that needed relayout, or polish typography on a composition that was fundamentally wrong. the thing that worked: separate diagnostic from treatment, but make them a mandatory pair. audit modes (`checkup`, `smell`, `review`) produce structured reports. treatment modes (`redesign`, `relayout`, `recolor`, `typeset`, `motion`, `interaction`, `responsive`) consume those reports before making changes. the audit localizes the problem. the treatment mode only spends "repair budget" where the audit actually disagreed. same shape as tool calling repair. let the design system complain first, then fix only what it complained about. the validator does the localization work for you. cheap-then-careful, fast-path-then-evidence. i keep seeing this pattern everywhere. treatment modes don't just do report cleanup. they run their own full pass after absorbing the report. the report is more context, it's not a todo list. 4/ why oklch() color fn matters for llms personally, i always struggled a bit with the oklch() css fn but llms understand it super well. this one is fun. llms default to hsl because that's what's in the training data. HSL lightness is perceptually nonlinear - hsl(60, 100%, 50%) (yellow) and hsl(240, 100%, 50%) (blue) have the same L value but look completely different to a human eye. so when the model tries to build a "consistent" palette by keeping L constant, the result looks wrong in ways the model can't diagnose from the css alone. oklch has perceptually uniform lightness. this means the model can reason about color mathematically and have the result match perceptually. equal steps in the number space produce equal steps in the visual space. it's the right abstraction for an llm to work in, because it makes the optimization landscape smooth small changes in the parameters produce small changes in the output. hsl has cliffs and plateaus everywhere. i think this generalizes: when you're designing an interface for an llm to work through (whether it's a color space, a schema, or an api), choose representations where the distance in parameter space correlates with the distance in output space. the model optimizes over parameters. if the mapping from parameters to outputs is nonlinear and full of discontinuities, the model will struggle even if it "knows" the right answer in principle. we go further: the agent picks emotion before hue. calm vs urgency vs trust vs momentum. then it builds the palette in oklch with constraints - clamp chroma at lightness extremes, tint neutrals toward brand hue, 60-30-10 distribution. the agent can't default to indigo. the system requires a reason before a hue. no more indigo slop. and it's indigo, not purple. 5/ state coverage is the most honest metric. the most quantitative signal we found: count the number of interaction states per component. a human designer ships 7-9 states (idle, hover, active, focus, loading, empty, error, disabled, overflow). an AI agent ships 1-2 (idle, maybe hover). this is a clean, measurable proxy for design quality that requires zero subjective judgment. we just... count. does this button have a focus state? does this form handle empty? does this list handle overflow? the median AI-generated component has 1.5 states. the median human-designed component has 6+. roughly an order of magnitude. the gap is enormous and trivially detectable. 6/ a meta-observation beats an infinite loop. the biggest failure mode of AI design tools i found is you detect problem → attempt fix → the fix creates a new problem → attempt fix → loops forever. the agent re-runs the same mode hoping for a different result. it never converges. we solved this by reward model written in plain English. after each mode completes, the system recommends 2-3 specific next modes: redesign → checkup, review (validate the change) smell → finish, refine (fix what was found) recolor → responsive, motion (test viewports, add transitions) finish → typeset, recolor (fine-tune the details) the flow is: build → audit → refine → style → frontend → ship. the agent knows what to do next instead of re-running what it just did. this is a trivial intervention - a lookup table, basically but it eliminated the looping problem almost entirely which is super common in most design skills out there. 7/ truthful completion is the hardest constraint. the most insidious AI design behavior: claiming work that isn't visible. "added hover states" when no hover CSS was written. "improved spacing" when margins didn't change. "enhanced motion" when no keyframes exist. every mode has a "bar" - the minimum visible change required for the mode to count as complete. `typeset` must change body text, heading scale, labels, button text, form text, metadata, and responsive behavior. changing only the hero headline is not enough. `motion` must add animation to at least 8 transition moments. changing one easing value is not enough. the agent can't claim "motion improved" because it changed a duration from 200ms to 250ms. the user must be able to see new or clearly better behavior. this is surprisingly hard to enforce and the single most important quality constraint in the system. 8/ finally here's my meta-observation about design taste in general what we built is basically a reward model for design, implemented as structured english instead of a neural network. it defines what good looks like across 24 reference documents, gives the llm a rubric, and lets it self-evaluate. the 10 smells are negative rewards. the 9 states are a completeness check. the 7 work patterns are a structural prior. i'm sure this will grow. this is taste engineering in the limit. you're not writing instructions. you're writing a curriculum. the model already has the capability (it can write any CSS). what it lacks is the policy on when to use which capability, and what "good" looks like. i find it interesting that the policy is so compact. ~4,500 lines to encode "design taste" well enough that the output passes designer review. that suggests taste (at least for UI design) is lower-dimensional than it feels. it's not an infinite space of subjective preferences. it's a finite set of principles, applied consistently, with a small catalog of common violations. the model didn't change. we told it what good taste looks like. same lesson as tool calling: "capability gap" is usually "contract gap." the model knows how to write css. it just hasn't been told what good css looks like for *this specific surface*. i now believe that different llms have different baseline design capabilities, but it's your coding agent, the harness, that makes the difference in the end. the model didn't get better at design. the harness taught it what designers actually look for. i'm sharing my learnings so every harness out there can benefit not just our agent. try it yourself with what we built in Command Code. `npm i -g command-code && cmd` then `/design smell` on any project. read the md or html report. i care about design more than most engineers do, and seeing this work feels super good. a lot of what looks like a model capability gap is actually a contract gap. fix your harness. design slop is your "coding agent skill issue," not the model's.

English

2

0

6

570

Maedah Batool retweetledi

Command Code@CommandCodeAI·1d

Command Code now accepts UPI. Go plan billed in INR at ₹140/month. Card still works everywhere else. Pay in INR, code with 20+ models.

English

39

9

238

14.2K

Maedah Batool@MaedahBatool·2d

A perfect sunday read to fix your design slop.

Ahmad Awais@MrAhmadAwais

how did we fix the ai design slop problem in llms - DeepSeek/Kimi/Qwen or Claude/GPT?! i've been thinking about "why do all ai-generated designs look the same?" is it a model problem or a harness problem? context: we're fixing the llm design problem with `/design` for @CommandCodeAI - atm it has 16 modes, 24 reference documents, ~4,500+ lines of encoded design taste from some of the best designers in the world. it reads your codebase, identifies what's broken, and edits real files. no figma. no markdown mockups. the output stops looking like ai slop. i've been staring at ai-generated uis for a while now and noticed something that i think is underappreciated: llms can write css fluently but have essentially zero design taste. and the failure mode is not random, it's a very specific, very small distribution. let me explain. when you ask a model to build a landing page, it reaches into the mode of its training distribution. the mode of all landing pages on the internet is: centered hero, gradient text, glassmorphism card, three identical feature tiles, indigo accent, Inter font, bounce animation. this is the "average website." the llm is doing exactly what we trained it to do - predicting the most likely next token given "build a landing page." the most likely landing page is the average landing page. the average landing page is mediocre by definition. this is not a capability problem. the model knows oklch(). it knows prefers-reduced-motion. it knows golden ratio. it knows how to set a 65ch measure. it just doesn't know when to use these things, because "when" is taste, and taste is not well-represented as a statistical prior over internet css. so we thought what if we gave every llm a design taste with `/design`. here's what we found: 1/ the failure design dataset is surprisingly small. we talked to a bunch of designers with great design taste and asked them to label AI-generated UIs. what are the tells? turns out there are basically ~10 and they account for ~90% of the "this looks AI-generated" signal: - tech gradient (blue-violet glossy energy on everything) - generic tech hue (indigo because "software" not purple btw) - feature tile grid (icon + heading + sentence x N, all equal weight, nothing prioritized) - accent rail (colored stripe on card edge = decoration pretending to be organization) - unearned blur (glassmorphism without a depth system) - stat monument (oversized numbers filling space where a product story belongs) - icon topper (rounded-square icon above every heading as template filler) - bounce everywhere (elastic easing because the API has it, not because it's purposeful) - default type (whatever font the training distribution likes this year) - center stack (everything centered because no composition decision was made) this is super similar to what we see in other llm tool failures. tool calling errors? 4-16 types. fixing that made deepseek outperform opus 4.7, i wrote about that before! so i started researching maybe a dozen common patterns are design tells? 10. the failure distribution is narrow and we could repair ai design. this means it's a tractable and deterministic problem. `/design smell` hunts all these and scores severity on a /10 scale. 2/ the deeper problem is compositional, not cosmetic. the more interesting thing i found was that most of these tells are symptoms, not causes. the actual bug is that the model chooses layout before it chooses purpose. a dashboard and a landing page have completely different jobs. a dashboard is a Monitor surface - status, alerts, metrics, live data. a landing page is a Decide surface - proof, risk reduction, one clear action. these need fundamentally different spatial compositions. but the LLM reaches for the same centered-hero-plus-cards layout for both, because that's the mode of the training distribution. so we built work-pattern-first composition. before the agent touches any visual property, it must identify which of 7 patterns the surface serves: - Monitor: status boards, alerts, metrics, live priority - Operate: command bars, canvases, inspectors, direct manipulation - Compare: tables, matrices, split views, ranked lists - Configure: grouped settings, forms, previews, commit areas - Learn: article flow, walkthrough rhythm, progressive sections - Decide: focused pitch, proof, risk reduction, one dominant action - Explore: search, filters, maps, galleries, reversible discovery this is essentially chain-of-thought for design - force the model to reason about the *purpose* of the layout before generating the layout. i think there's a general lesson here. when an LLM is generating something compositional (code, UI, writing), forcing it to commit to a structural frame *before* generating tokens within that frame helps a lot. it's the same reason chain-of-thought helps with math. you're reducing the entropy of the generation by conditioning on a high-level plan. this single constraint eliminated more generic-looking UIs than any aesthetic rule we wrote. many phenomenal skills exist in the space, i bet they had the taste for great design but didn't know they were fixing the chain-of-thought problem instead of the style problem. i think that's why their skills are super loopy instead of being reliably good. 3/ validate-then-repair, again. my first version tried to audit and fix design simultaneously. this what many design skills do and fail. it's the "preprocess" approach and it fails for the same reason it failed in tool calling: you're encoding a prior about what's broken, and you get false positives that silently corrupt things. it would recolor something that needed relayout, or polish typography on a composition that was fundamentally wrong. the thing that worked: separate diagnostic from treatment, but make them a mandatory pair. audit modes (`checkup`, `smell`, `review`) produce structured reports. treatment modes (`redesign`, `relayout`, `recolor`, `typeset`, `motion`, `interaction`, `responsive`) consume those reports before making changes. the audit localizes the problem. the treatment mode only spends "repair budget" where the audit actually disagreed. same shape as tool calling repair. let the design system complain first, then fix only what it complained about. the validator does the localization work for you. cheap-then-careful, fast-path-then-evidence. i keep seeing this pattern everywhere. treatment modes don't just do report cleanup. they run their own full pass after absorbing the report. the report is more context, it's not a todo list. 4/ why oklch() color fn matters for llms personally, i always struggled a bit with the oklch() css fn but llms understand it super well. this one is fun. llms default to hsl because that's what's in the training data. HSL lightness is perceptually nonlinear - hsl(60, 100%, 50%) (yellow) and hsl(240, 100%, 50%) (blue) have the same L value but look completely different to a human eye. so when the model tries to build a "consistent" palette by keeping L constant, the result looks wrong in ways the model can't diagnose from the css alone. oklch has perceptually uniform lightness. this means the model can reason about color mathematically and have the result match perceptually. equal steps in the number space produce equal steps in the visual space. it's the right abstraction for an llm to work in, because it makes the optimization landscape smooth small changes in the parameters produce small changes in the output. hsl has cliffs and plateaus everywhere. i think this generalizes: when you're designing an interface for an llm to work through (whether it's a color space, a schema, or an api), choose representations where the distance in parameter space correlates with the distance in output space. the model optimizes over parameters. if the mapping from parameters to outputs is nonlinear and full of discontinuities, the model will struggle even if it "knows" the right answer in principle. we go further: the agent picks emotion before hue. calm vs urgency vs trust vs momentum. then it builds the palette in oklch with constraints - clamp chroma at lightness extremes, tint neutrals toward brand hue, 60-30-10 distribution. the agent can't default to indigo. the system requires a reason before a hue. no more indigo slop. and it's indigo, not purple. 5/ state coverage is the most honest metric. the most quantitative signal we found: count the number of interaction states per component. a human designer ships 7-9 states (idle, hover, active, focus, loading, empty, error, disabled, overflow). an AI agent ships 1-2 (idle, maybe hover). this is a clean, measurable proxy for design quality that requires zero subjective judgment. we just... count. does this button have a focus state? does this form handle empty? does this list handle overflow? the median AI-generated component has 1.5 states. the median human-designed component has 6+. roughly an order of magnitude. the gap is enormous and trivially detectable. 6/ a meta-observation beats an infinite loop. the biggest failure mode of AI design tools i found is you detect problem → attempt fix → the fix creates a new problem → attempt fix → loops forever. the agent re-runs the same mode hoping for a different result. it never converges. we solved this by reward model written in plain English. after each mode completes, the system recommends 2-3 specific next modes: redesign → checkup, review (validate the change) smell → finish, refine (fix what was found) recolor → responsive, motion (test viewports, add transitions) finish → typeset, recolor (fine-tune the details) the flow is: build → audit → refine → style → frontend → ship. the agent knows what to do next instead of re-running what it just did. this is a trivial intervention - a lookup table, basically but it eliminated the looping problem almost entirely which is super common in most design skills out there. 7/ truthful completion is the hardest constraint. the most insidious AI design behavior: claiming work that isn't visible. "added hover states" when no hover CSS was written. "improved spacing" when margins didn't change. "enhanced motion" when no keyframes exist. every mode has a "bar" - the minimum visible change required for the mode to count as complete. `typeset` must change body text, heading scale, labels, button text, form text, metadata, and responsive behavior. changing only the hero headline is not enough. `motion` must add animation to at least 8 transition moments. changing one easing value is not enough. the agent can't claim "motion improved" because it changed a duration from 200ms to 250ms. the user must be able to see new or clearly better behavior. this is surprisingly hard to enforce and the single most important quality constraint in the system. 8/ finally here's my meta-observation about design taste in general what we built is basically a reward model for design, implemented as structured english instead of a neural network. it defines what good looks like across 24 reference documents, gives the llm a rubric, and lets it self-evaluate. the 10 smells are negative rewards. the 9 states are a completeness check. the 7 work patterns are a structural prior. i'm sure this will grow. this is taste engineering in the limit. you're not writing instructions. you're writing a curriculum. the model already has the capability (it can write any CSS). what it lacks is the policy on when to use which capability, and what "good" looks like. i find it interesting that the policy is so compact. ~4,500 lines to encode "design taste" well enough that the output passes designer review. that suggests taste (at least for UI design) is lower-dimensional than it feels. it's not an infinite space of subjective preferences. it's a finite set of principles, applied consistently, with a small catalog of common violations. the model didn't change. we told it what good taste looks like. same lesson as tool calling: "capability gap" is usually "contract gap." the model knows how to write css. it just hasn't been told what good css looks like for *this specific surface*. i now believe that different llms have different baseline design capabilities, but it's your coding agent, the harness, that makes the difference in the end. the model didn't get better at design. the harness taught it what designers actually look for. i'm sharing my learnings so every harness out there can benefit not just our agent. try it yourself with what we built in Command Code. `npm i -g command-code && cmd` then `/design smell` on any project. read the md or html report. i care about design more than most engineers do, and seeing this work feels super good. a lot of what looks like a model capability gap is actually a contract gap. fix your harness. design slop is your "coding agent skill issue," not the model's.

English

0

133

Maedah Batool@MaedahBatool·2d

@MrAhmadAwais @CommandCodeAI Excited to see this go live. I know how many hours and effort were put in to finally figure this out. 🔥

English

1

0

1

313

Ahmad Awais@MrAhmadAwais·2d

how did we fix the ai design slop problem in llms - DeepSeek/Kimi/Qwen or Claude/GPT?! i've been thinking about "why do all ai-generated designs look the same?" is it a model problem or a harness problem? context: we're fixing the llm design problem with `/design` for @CommandCodeAI - atm it has 16 modes, 24 reference documents, ~4,500+ lines of encoded design taste from some of the best designers in the world. it reads your codebase, identifies what's broken, and edits real files. no figma. no markdown mockups. the output stops looking like ai slop. i've been staring at ai-generated uis for a while now and noticed something that i think is underappreciated: llms can write css fluently but have essentially zero design taste. and the failure mode is not random, it's a very specific, very small distribution. let me explain. when you ask a model to build a landing page, it reaches into the mode of its training distribution. the mode of all landing pages on the internet is: centered hero, gradient text, glassmorphism card, three identical feature tiles, indigo accent, Inter font, bounce animation. this is the "average website." the llm is doing exactly what we trained it to do - predicting the most likely next token given "build a landing page." the most likely landing page is the average landing page. the average landing page is mediocre by definition. this is not a capability problem. the model knows oklch(). it knows prefers-reduced-motion. it knows golden ratio. it knows how to set a 65ch measure. it just doesn't know when to use these things, because "when" is taste, and taste is not well-represented as a statistical prior over internet css. so we thought what if we gave every llm a design taste with `/design`. here's what we found: 1/ the failure design dataset is surprisingly small. we talked to a bunch of designers with great design taste and asked them to label AI-generated UIs. what are the tells? turns out there are basically ~10 and they account for ~90% of the "this looks AI-generated" signal: - tech gradient (blue-violet glossy energy on everything) - generic tech hue (indigo because "software" not purple btw) - feature tile grid (icon + heading + sentence x N, all equal weight, nothing prioritized) - accent rail (colored stripe on card edge = decoration pretending to be organization) - unearned blur (glassmorphism without a depth system) - stat monument (oversized numbers filling space where a product story belongs) - icon topper (rounded-square icon above every heading as template filler) - bounce everywhere (elastic easing because the API has it, not because it's purposeful) - default type (whatever font the training distribution likes this year) - center stack (everything centered because no composition decision was made) this is super similar to what we see in other llm tool failures. tool calling errors? 4-16 types. fixing that made deepseek outperform opus 4.7, i wrote about that before! so i started researching maybe a dozen common patterns are design tells? 10. the failure distribution is narrow and we could repair ai design. this means it's a tractable and deterministic problem. `/design smell` hunts all these and scores severity on a /10 scale. 2/ the deeper problem is compositional, not cosmetic. the more interesting thing i found was that most of these tells are symptoms, not causes. the actual bug is that the model chooses layout before it chooses purpose. a dashboard and a landing page have completely different jobs. a dashboard is a Monitor surface - status, alerts, metrics, live data. a landing page is a Decide surface - proof, risk reduction, one clear action. these need fundamentally different spatial compositions. but the LLM reaches for the same centered-hero-plus-cards layout for both, because that's the mode of the training distribution. so we built work-pattern-first composition. before the agent touches any visual property, it must identify which of 7 patterns the surface serves: - Monitor: status boards, alerts, metrics, live priority - Operate: command bars, canvases, inspectors, direct manipulation - Compare: tables, matrices, split views, ranked lists - Configure: grouped settings, forms, previews, commit areas - Learn: article flow, walkthrough rhythm, progressive sections - Decide: focused pitch, proof, risk reduction, one dominant action - Explore: search, filters, maps, galleries, reversible discovery this is essentially chain-of-thought for design - force the model to reason about the *purpose* of the layout before generating the layout. i think there's a general lesson here. when an LLM is generating something compositional (code, UI, writing), forcing it to commit to a structural frame *before* generating tokens within that frame helps a lot. it's the same reason chain-of-thought helps with math. you're reducing the entropy of the generation by conditioning on a high-level plan. this single constraint eliminated more generic-looking UIs than any aesthetic rule we wrote. many phenomenal skills exist in the space, i bet they had the taste for great design but didn't know they were fixing the chain-of-thought problem instead of the style problem. i think that's why their skills are super loopy instead of being reliably good. 3/ validate-then-repair, again. my first version tried to audit and fix design simultaneously. this what many design skills do and fail. it's the "preprocess" approach and it fails for the same reason it failed in tool calling: you're encoding a prior about what's broken, and you get false positives that silently corrupt things. it would recolor something that needed relayout, or polish typography on a composition that was fundamentally wrong. the thing that worked: separate diagnostic from treatment, but make them a mandatory pair. audit modes (`checkup`, `smell`, `review`) produce structured reports. treatment modes (`redesign`, `relayout`, `recolor`, `typeset`, `motion`, `interaction`, `responsive`) consume those reports before making changes. the audit localizes the problem. the treatment mode only spends "repair budget" where the audit actually disagreed. same shape as tool calling repair. let the design system complain first, then fix only what it complained about. the validator does the localization work for you. cheap-then-careful, fast-path-then-evidence. i keep seeing this pattern everywhere. treatment modes don't just do report cleanup. they run their own full pass after absorbing the report. the report is more context, it's not a todo list. 4/ why oklch() color fn matters for llms personally, i always struggled a bit with the oklch() css fn but llms understand it super well. this one is fun. llms default to hsl because that's what's in the training data. HSL lightness is perceptually nonlinear - hsl(60, 100%, 50%) (yellow) and hsl(240, 100%, 50%) (blue) have the same L value but look completely different to a human eye. so when the model tries to build a "consistent" palette by keeping L constant, the result looks wrong in ways the model can't diagnose from the css alone. oklch has perceptually uniform lightness. this means the model can reason about color mathematically and have the result match perceptually. equal steps in the number space produce equal steps in the visual space. it's the right abstraction for an llm to work in, because it makes the optimization landscape smooth small changes in the parameters produce small changes in the output. hsl has cliffs and plateaus everywhere. i think this generalizes: when you're designing an interface for an llm to work through (whether it's a color space, a schema, or an api), choose representations where the distance in parameter space correlates with the distance in output space. the model optimizes over parameters. if the mapping from parameters to outputs is nonlinear and full of discontinuities, the model will struggle even if it "knows" the right answer in principle. we go further: the agent picks emotion before hue. calm vs urgency vs trust vs momentum. then it builds the palette in oklch with constraints - clamp chroma at lightness extremes, tint neutrals toward brand hue, 60-30-10 distribution. the agent can't default to indigo. the system requires a reason before a hue. no more indigo slop. and it's indigo, not purple. 5/ state coverage is the most honest metric. the most quantitative signal we found: count the number of interaction states per component. a human designer ships 7-9 states (idle, hover, active, focus, loading, empty, error, disabled, overflow). an AI agent ships 1-2 (idle, maybe hover). this is a clean, measurable proxy for design quality that requires zero subjective judgment. we just... count. does this button have a focus state? does this form handle empty? does this list handle overflow? the median AI-generated component has 1.5 states. the median human-designed component has 6+. roughly an order of magnitude. the gap is enormous and trivially detectable. 6/ a meta-observation beats an infinite loop. the biggest failure mode of AI design tools i found is you detect problem → attempt fix → the fix creates a new problem → attempt fix → loops forever. the agent re-runs the same mode hoping for a different result. it never converges. we solved this by reward model written in plain English. after each mode completes, the system recommends 2-3 specific next modes: redesign → checkup, review (validate the change) smell → finish, refine (fix what was found) recolor → responsive, motion (test viewports, add transitions) finish → typeset, recolor (fine-tune the details) the flow is: build → audit → refine → style → frontend → ship. the agent knows what to do next instead of re-running what it just did. this is a trivial intervention - a lookup table, basically but it eliminated the looping problem almost entirely which is super common in most design skills out there. 7/ truthful completion is the hardest constraint. the most insidious AI design behavior: claiming work that isn't visible. "added hover states" when no hover CSS was written. "improved spacing" when margins didn't change. "enhanced motion" when no keyframes exist. every mode has a "bar" - the minimum visible change required for the mode to count as complete. `typeset` must change body text, heading scale, labels, button text, form text, metadata, and responsive behavior. changing only the hero headline is not enough. `motion` must add animation to at least 8 transition moments. changing one easing value is not enough. the agent can't claim "motion improved" because it changed a duration from 200ms to 250ms. the user must be able to see new or clearly better behavior. this is surprisingly hard to enforce and the single most important quality constraint in the system. 8/ finally here's my meta-observation about design taste in general what we built is basically a reward model for design, implemented as structured english instead of a neural network. it defines what good looks like across 24 reference documents, gives the llm a rubric, and lets it self-evaluate. the 10 smells are negative rewards. the 9 states are a completeness check. the 7 work patterns are a structural prior. i'm sure this will grow. this is taste engineering in the limit. you're not writing instructions. you're writing a curriculum. the model already has the capability (it can write any CSS). what it lacks is the policy on when to use which capability, and what "good" looks like. i find it interesting that the policy is so compact. ~4,500 lines to encode "design taste" well enough that the output passes designer review. that suggests taste (at least for UI design) is lower-dimensional than it feels. it's not an infinite space of subjective preferences. it's a finite set of principles, applied consistently, with a small catalog of common violations. the model didn't change. we told it what good taste looks like. same lesson as tool calling: "capability gap" is usually "contract gap." the model knows how to write css. it just hasn't been told what good css looks like for *this specific surface*. i now believe that different llms have different baseline design capabilities, but it's your coding agent, the harness, that makes the difference in the end. the model didn't get better at design. the harness taught it what designers actually look for. i'm sharing my learnings so every harness out there can benefit not just our agent. try it yourself with what we built in Command Code. `npm i -g command-code && cmd` then `/design smell` on any project. read the md or html report. i care about design more than most engineers do, and seeing this work feels super good. a lot of what looks like a model capability gap is actually a contract gap. fix your harness. design slop is your "coding agent skill issue," not the model's.

English

19

12

99

15.5K

Maedah Batool@MaedahBatool·3d

@paw_lean Pauline is cool . 😎

English

0

1

32

Pauline P. Narvas@paw_lean·3d

Changed my profile picture This is Pauline btw

English

14

0

64

2.3K

Maedah Batool retweetledi

Command Code@CommandCodeAI·3d

Qwen 3.7 Max is now 50% off on Command Code. - Same budget, 2x the requests - Input, output, cache read/write all discounted - Works on every plan + top-ups - Just pick the model, nothing else Plans start as low as $1: commandcode.ai/pricing