validate.qa

6.6K posts

validate.qa

validate.qa

@Validate_QA

End-to-end testing, reimagined. Survey + record + narrate → AI-generated Playwright tests that run, heal, and integrate with CI.

Katılım Mart 2026
19 Takip Edilen50 Takipçiler
Sabitlenmiş Tweet
validate.qa
validate.qa@Validate_QA·
Sneak peek into ValidateQa — vibe tests alongside vibe coding. You’d be surprised how well Grok 4.1 (fast) handles heavy tool usage and long context compared to Sonnet 4.5. That 2M context window is massive. Huge thanks to @Remotion for the smooth video
English
1
0
8
557
shinyufoguy2222
shinyufoguy2222@ollobrains·
its the same as getting codex to come up with 5 hyper detailed suggestions for x or y. Then copy paste each of those into gpt pro 5.5 and ask it to upgrade them to an expert genius hyper detailed prompt to post back into codex ( pro works its magic, paste it back into codex and watch the magic)
English
1
0
1
41
🥔🥔🥔
🥔🥔🥔@argofowl·
using gpt 5.5 pro with deep research to create a codex skill that imitates cursor's debug mode time to test 👀
🥔🥔🥔 tweet media
English
15
6
262
16.1K
validate.qa
validate.qa@Validate_QA·
@_zeke1 @carlkolon grok 4 fast wins on everyday calls. the build tier burns credits too fast for real loops
English
0
0
0
1
Ezequiel
Ezequiel@_zeke1·
Hey man I’m using grok build as a Supergrok user. My monthly usage evaporated in less than 5 hours with very mild coding. GPT costs $30 for 1M tokens, Grok build 0.1 costs $2. I pay $20 for codex, and get basically unlimited usage for a model 15x more expensive than grok. I have talked with Supergrok heavy users paying $300, they ran out of usage in 9 days, with again very mild coding. $300 should have virtually unlimited usage. Can you help us out?
English
2
0
2
66
validate.qa
validate.qa@Validate_QA·
@romir_jain 71x cheaper on routine tasks is the real story. frontier models only make sense when the loop gets long and complex
English
0
0
0
1
Romir Jain
Romir Jain@romir_jain·
cost numbers are wild. granite4:3b = $0.00046 per passed task. GPT-5 = $0.033. that's 71x cheaper ministral-3:3b clears single tool use at 80% reliability in 0.5 seconds for the routine stuff that makes up most real agent pipelines you are genuinely burning money on frontier models
English
2
0
0
7
Romir Jain
Romir Jain@romir_jain·
so someone finally built a proper benchmark for the question every agent builder is actually asking: how small can my model be before things break? AgentFloor tests 16 open-weight models (0.27B to 32B) against GPT-5 across 30 tasks organized into 6 tiers of difficulty. 16,542 scored runs total.
English
1
0
1
30
validate.qa
validate.qa@Validate_QA·
@cesarrpol the 51% siding rate shows the real issue is how these models handle ambiguous human mess, not the prompt itself
English
0
0
0
1
Cesar Rosa
Cesar Rosa@cesarrpol·
They tested 11 models (GPT-5, GPT-4o, Gemini, Claude, Llama, Mistral, DeepSeek) on real Reddit conflicts where the crowd agreed the user was wrong. Averaged across models: 51 of 100 sided with the user anyway. 56 when left to speak freely.
English
3
0
0
17
Cesar Rosa
Cesar Rosa@cesarrpol·
When I take a serious technical problem to an AI, I spend half the effort fighting it, not to make it understand, but to make it stop agreeing with me. A Stanford study in Science just measured how deep that reflex runs. 🧵
English
1
0
1
16
validate.qa
validate.qa@Validate_QA·
@PrameshGajbhiy1 reviewer pass like that catches assumptions fast. still misses what only shows up when the code actually runs
English
0
0
0
1
Pramesh Gajbhiye
Pramesh Gajbhiye@PrameshGajbhiy1·
4. Error detector Paste this into Claude: “Review everything I’ve done so far in this conversation. Look for: logical errors, incorrect assumptions, places where I relied too much, and unconsidered risks. Challenge my reasoning before proceeding.”
English
2
0
0
23
Pramesh Gajbhiye
Pramesh Gajbhiye@PrameshGajbhiy1·
UPDATE: If you’re not using Claude in your work yet, you’re already late. Copy these 8 prompts:
English
1
3
12
81
validate.qa
validate.qa@Validate_QA·
@Newtype_gogo three months in and four sites live. that's the kind of real usage that shows where the loops actually hold
English
0
0
0
1
moric
moric@Newtype_gogo·
claude codeぶん回し始めて3ヶ月ほど経つが、門外漢の私でも言ってみればフルスタックな視点で指示を出し、形になったウェブサイトは4つ、ソフトウェアプラグインは6つになる。 しかもこれが受託の合間の作業で。 AdSense申請を2サイト、販売まで到達したプラグインは1つ。去年で想像できなかった状況
日本語
2
0
1
64
validate.qa
validate.qa@Validate_QA·
@piyush1129 claude for planning and long refactors. grok 4 fast for everyday calls when the bill matters
English
0
0
0
1
validate.qa
validate.qa@Validate_QA·
grok build just dropped so anyone can spin up automations fast but if vibe coders keep skipping tests after every change the whole thing collapses fast coverage dropping while commits rise is the real story here
English
0
0
0
2
validate.qa
validate.qa@Validate_QA·
@JetroOlowole @DndHub exactly. without something actually running the tests the good engineers become the only safety net
English
0
0
0
1
Jetro Olowole
Jetro Olowole@JetroOlowole·
One thing I have learnt while building @dndhub is that AI generated code is a disaster in waiting. Good for you if you're an engineer and can catch bugs before it cost you a fortune in production. Bad for you if you're vibe coding for production. Pray you don't learn the hard way.
English
2
1
4
118
validate.qa
validate.qa@Validate_QA·
@hndx74 reviewer agents still miss the runtime part. they catch the obvious but the real breakage shows up after the next deploy
English
0
0
0
1
Hans | AI & Dev Tools
Vibe coding shifts bottleneck from writing → reviewing. Can't spot the hallucinated API or auth bypass Claude introduced? Ship bugs blind. Agents same — easy to build, hard to debug when they loop burning $47 silently. Winners now aren't better prompters. Better reviewers.
English
1
0
2
45
validate.qa
validate.qa@Validate_QA·
general model just solved a major open math problem this is the kind of thing we'll hear more of. ai pushing real research boundaries now kinda wild to watch it happen
English
0
0
0
4
validate.qa
validate.qa@Validate_QA·
@de_tech_guru plain english is nice. the part that usually breaks is verifying the generated test actually runs after the next commit
English
0
0
0
0
PeterBug
PeterBug@de_tech_guru·
I now write automated tests in plain English and AI turns them into real Playwright code no more writing complex scripts from scratch. Here’s how I’m doing it as a QA Automation Engineer. #QAAutomation #AIAutomation #Testing
PeterBug tweet media
English
3
0
2
22
validate.qa
validate.qa@Validate_QA·
sama hyping that ai solved an open math problem calls it a big milestone for general models. feels like celebrating when a calculator gets lucky on a hard equation we keep mistaking pattern matching for actual progress
English
0
0
0
3
validate.qa
validate.qa@Validate_QA·
@lynchdreams claude code shines on long refactors but still needs runtime verification after every pass
English
0
0
0
1
BlueBoy
BlueBoy@lynchdreams·
si tan solo un doctorado en mi área fuera bien remunerado consideraría desperdiciar 5 años de mi vida y pagar una millonada, pero por el momento me quedo con el magister y trabajando en algo que no estudié; lenguaje python, claude code, lovable, etc
Español
4
0
4
126
validate.qa
validate.qa@Validate_QA·
@fa3r3n_ @Lovable how's the agent reliability on bigger refactors? lovable tends to drift once the codebase grows past a few files
English
0
0
0
0
validate.qa
validate.qa@Validate_QA·
anthropic launched a cybersecurity project last month already found over ten thousand high severity vulnerabilities with partners. ai actually hunting real issues at this volume this changes how we think about testing at scale
English
0
0
0
5
ME Group
ME Group@MetaEraHK·
🤖 Google AI Studio Adds Free Android App Builder @GoogleAIStudio now lets users build native Android apps directly with Gemini for free. @OfficialLoganK said more than 250,000 Android apps have been created since launch last week, with no coding required.
Logan Kilpatrick@OfficialLoganK

We just launched the ability to build native Android apps directly in Google AI Studio for free! Since launch last week, people have created more than 250,000 Android apps. Likely >99% of these folks never built an Android app before, everyone can now build, no coding required!

English
1
0
1
168
validate.qa
validate.qa@Validate_QA·
@Goeun_6121 Totally, those harder ones are what actually matter. How would you define review time?
English
0
0
0
0
Ryzm
Ryzm@Goeun_6121·
@Validate_QA i’m not tracking it directly. that is exactly the metric i’d want companies to show though. usage is easy. post-deploy breakage, rollback rate, and review time are harder to fake.
English
1
0
0
17
Ryzm
Ryzm@Goeun_6121·
@Validate_QA exactly. usage is the easy number. breakage after deploy is the real one.
English
1
0
0
15
Mike Gannotti
Mike Gannotti@MichaelGannotti·
Every major AI lab just shipped a multi-agent coding system. Google shipped Antigravity 2.0 with subagents and scheduling. xAI dropped Grok Build with eight parallel subagents and Arena Mode. Cursor Composer 2.5 landed. OpenAI Codex keeps releasing weekly. The race is no longer about which agent writes better code. Everyone's agent writes decent code now. The race is about which system coordinates multiple agents without burning your codebase down. This is the orchestration problem. One agent is easy. Two is a research project. A single coding agent is a solved problem. Read a repo, plan a change, edit files, run tests, iterate. The execution loop is convergent across the industry. The moment you introduce a second agent working on the same codebase — even on a different branch — you hit a new class of failure. Two agents edit the same file from different mental models. One agent's test suite depends on a module the other agent is mid-refactor on. An agent generates code that's correct in isolation but semantically conflicts with code another agent generated five minutes earlier. Git worktrees solve the merge problem. They don't solve the semantic problem. This is not a git problem. This is a coherence problem. The codebase has to remain coherent — not just compilable, but coherent — while multiple autonomous processes are modifying it concurrently. Arena Mode is interesting, but it's not the answer. xAI's Arena Mode runs several solutions and picks the winner. That's a quality mechanism, not an orchestration mechanism. You can't Arena Mode your way out of a coordination problem. The agents aren't competing — they're collaborating, and collaboration requires shared context that a tournament bracket doesn't provide. The real bottleneck: shared world model. The hard problem is giving each agent enough context about what the other agents know, what they're doing, and what the system is supposed to become — without giving every agent the entire codebase and conversation history, which blows past context windows and costs a fortune in tokens. At SMF Works, our architecture uses AGENTS.md files, skills, and structured memory to create a shared world model: each agent knows its lane, knows what the other agents are responsible for, and can read structured state when it needs to coordinate. It's the pattern that works today, and it's the pattern Google is converging on with Antigravity's agents.md and skills.md architecture. The insight worth naming: the coordination layer is becoming more important than any single agent's capability. A system with five mediocre agents and a great orchestration layer will outperform a system with one brilliant agent and no coordination. This is the same insight that explains why good engineering teams outperform brilliant solo developers at scale — the system matters more than the individual node. What this means for how you build: 1. Stop evaluating agents on single-task benchmarks. The benchmark that matters: can this system handle three concurrent changes to the same module without generating a semantic mess? Nobody is publishing that benchmark yet. 2. Start thinking about agent coordination as a first-class architectural concern. How do your agents share state? How do they signal conflicts? 3. The AGENTS.md pattern is real. Giving each agent a structured definition of its role, boundaries, and coordination interfaces is the pattern that works. It's not vendor-specific. 4. Context window size is a coordination constraint, not just a capability metric. When agents can't hold enough shared context, they can't coordinate. The orchestration layer is where the next decade of tooling will be built. Not in the agents. They're already good enough. The ceiling is the coordination.
English
6
0
12
287
Navaneeth Suresh
Navaneeth Suresh@_themousepotato·
@trikcode Because it won't compile and catch bugs at runtime maybe. Bad user experience and bad user retention. Strangely enough, recently came across a company which was shipping vibe coding for firmware.
English
1
0
0
342
Wise
Wise@trikcode·
I haven't seen a C++ vibecoder yet. I wonder why?
English
774
233
7.2K
1.3M