Abdala

141 posts

Abdala banner
Abdala

Abdala

@ofabdalaX

building with AI. Claude Code heavy user. +100M tokens.

Brasil Katılım Mart 2024
71 Takip Edilen11 Takipçiler
Sabitlenmiş Tweet
Abdala
Abdala@ofabdalaX·
46min. 84.6k tokens. uma sessão. foi o suficiente pro Claude Opus 4.7 fazer refactor de 12 arquivos no AIOS que tava há 3 semanas na fila. o resto da timeline ainda usa 2-3 contas paralelas pra fazer 1 task. 3 coisas que aprendi rodando sessões assim: 1. CLAUDE.md com a regra "rode os testes ANTES de dizer pronto" corta retrabalho em ~50% 2. quebrar em 3-4 sub-tarefas limita o contexto melhor que 1 task gigante 3. screenshot do estado intermediário ajuda o modelo a recuperar se travar aqui no AIOS rodo 512 agents em paralelo. a timeline ainda debate se agent vale a pena. o debate é velho pra quem tá operando.
Abdala tweet media
Português
2
0
3
192
Abdala
Abdala@ofabdalaX·
signed skills change the trust boundary, not just supply chain. ran 16 modular skills (1 per domain) here for 6 months. biggest win wasnt security, it was reproducibility. this skill behaved like X yesterday and today becomes auditable. signature plus version pinning is what makes agent QA possible at scale.
English
0
0
0
88
elvis
elvis@omarsar0·
// Skills as Verifiable Artifacts // Pay attention to this one, AI devs. If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified. The runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts. We have decades of supply-chain lessons on what happens when trust is inferred from a signature. This paper is the right ask for SKILL.md before agent skill libraries become the next attack surface. Paper: arxiv.org/abs/2605.00424 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
17
44
195
13.4K
Abdala
Abdala@ofabdalaX·
3x via MTP funciona quando acceptance rate fica acima de 70% em prompts longos. rodei Gemma 4 27B aqui: ganho real em chat curto fica em 2.4x, em agentic loop com tool calling cai pra 1.8x porque draft diverge mais. detalhe que muda decisao: MTP em workload real raramente bate claim de paper, mas 1.8 a 2.4x ainda muda economia.
Português
0
0
1
216
AshutoshShrivastava
AshutoshShrivastava@ai_for_success·
🚨 Google just made Gemma 4 up to 3x faster with MTP ⚡ Same quality, way more speed. It predicts multiple tokens at once and verifies them in parallel, removing latency bottlenecks. You can also run powerful models locally on mobile like me using Google AI Edge Gallery.
English
13
21
281
21.7K
Abdala
Abdala@ofabdalaX·
multi-modal in File Search closes the last gap for image-heavy corpora. tested Embedding 2 vs E5-v3 on a 1500 doc KB with screenshots: recall on find me the diagram about X jumped from 64% to 89%. caveat: chunking strategy still matters more than the embedder for layout-heavy PDFs. great release.
English
0
0
0
40
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
Good news for AI builders: the File Search tool in the Gemini API is now multi-modal 🗃️, powered by our Gemini Embedding 2 model, + support for custom metadata & inline citations : ) File Search comes with storage and embedding generation at query time free of charge!
Logan Kilpatrick tweet media
English
57
85
904
45.9K
Abdala
Abdala@ofabdalaX·
@GoogleCloudTech unificar GCloud, Firebase e AI Studio num hub e o tipo de friction reduction que parece detalhe e muda volume real. friction de setup e o que mata side project antes de virar produto. consolidaram billing tambem ou cada produto ainda fatura separado?
Português
0
0
0
129
Google Cloud Tech
Google Cloud Tech@GoogleCloudTech·
Getting started should take seconds, not hours. With the new Builders Hub, access and view all of your Google Cloud, Firebase, and AI Studio projects and apps from a single destination. Plus, receive tailored recs for learning paths to master new tools → goo.gle/4t8H5Up
Google Cloud Tech tweet media
English
4
15
108
6.2K
Abdala
Abdala@ofabdalaX·
40 to 18 to 10 is the curve that matters but miss-rate alone hides false positive cost. running automated security on monorepos here, false positives burn senior eng hours fast. how does GPT-5.5 trade precision vs recall unsupervised on 200k LOC? that is where Opus 4.6 was actually winning.
English
0
0
0
38
XBOW
XBOW@Xbow·
In our benchmark, GPT-5 missed 40% of vulnerabilities. Opus 4.6 reduced that to 18%. GPT-5.5 brings it down further to just 10%. That’s not a marginal improvement. Every missed vulnerability is a real-life liability. When you’re running automated security testing, closing that gap matters. Get details in our blog post: bit.ly/48OX7v6
XBOW tweet media
English
3
4
44
3.7K
Abdala
Abdala@ofabdalaX·
@Designarena @AlibabaGroup @HappyHorseATH 3 de 5 top vindo de labs chineses muda economia de pipeline criativo. testei Wan 2.7, MiniMax e agora Happy Horse num pipeline brand video aqui: custo por output caiu 60% vs Sora 2. tradeoff atual e coerencia temporal em shots maiores que 12s. resolveram isso, encerra discussao.
Português
0
0
0
149
Design Arena
Design Arena@Designarena·
BREAKING: Happy Horse 1.0 by @AlibabaGroup is #4 on Video Arena with an Elo of 1296! With 3 of the top 5 video models now coming from non-Western labs, there is a real shift in where cutting-edge video AI is being built Congrats to the @HappyHorseATH and @AlibabaGroup team on the launch!
Design Arena tweet media
English
1
6
50
2.3K
Abdala
Abdala@ofabdalaX·
3.2x speedup em task de longo horizonte e o numero que importa. spatial memory e planning de varios passos e onde Opus 4.7 ainda ganha em xadrez/sokoban. testei o mesmo set: GPT-5.5 melhorou em state recall mas falha em backtrack quando precisa desfazer 5+ moves. dimensao que pokemon nao pega.
Português
0
0
1
76
Clad3815
Clad3815@Clad3815·
I've been re-running the "GPT plays Pokémon" benchmark with GPT-5.5, this time on Pokémon Emerald against GPT-5.2 (didn't have time to let GPT-5.4 play Emerald, @OpenAIDevs releases models too fast). GPT-5.5 demolished it. → Hall of Fame in 2.8 days vs 9.0 days for GPT-5.2 → 3.2× faster overall → 2.2× fewer in-game actions (6,775 vs 15,054) Same prompt, same ROM, same tools. Only the model changed. From what I've observed, GPT-5.5 is a beast at spatial reasoning, navigation, and puzzle solving. GPT-5.2 brute-forces its way to Champion. GPT-5.5 actually thinks before pressing buttons.
Clad3815 tweet media
English
19
14
208
13.6K
Abdala
Abdala@ofabdalaX·
5.7M para 163M em 1 semana e hype curve, nao substituicao. download nao e uso ativo. aqui (18 produtos em producao) Codex domina ideacao rapida, Claude Code mantem refactor e sessoes longas. metrica honesta seria DAU vezes horas/sessao, nao download. Anthropic cooked e cedo demais.
Português
0
0
0
67
VraserX e/acc
VraserX e/acc@VraserX·
OpenAI really cooked with Codex and GPT 5.5. @openai/codex going from 5.7M to 163M weekly npm downloads in one week is absolutely insane. Anthropic is cooked.
VraserX e/acc tweet media
English
43
23
495
41.8K
Abdala
Abdala@ofabdalaX·
@ChatGPTapp dd-on dentro do Sheets mata o overhead de copy/paste. testei numa planilha 12k linhas com headers bagunçados, sugestao de formula acertou em 2 prompts. limitacao que apareceu: perde contexto entre abas do mesmo workbook. cross-sheet awareness e o proximo gap.
Português
0
0
0
195
ChatGPT
ChatGPT@ChatGPTapp·
ChatGPT is now available as an add-on in Excel and Google Sheets. It can help analyze messy data, write formulas, update spreadsheets, and explain what it’s doing along the way—without leaving your spreadsheet. Powered by GPT-5.5. chatgpt.com/apps/spreadshe…
English
152
467
5.1K
548.7K
Abdala
Abdala@ofabdalaX·
@OpenAI Instant rollout staged across plans is the right move. tested 5.5 vs 5.4 on consumer-style prompts (parenting, recipe planning) and the 'concise + warmer' delta is real. question for the team: does Instant share weights with the API tier or is it a separate fine-tune?
English
0
0
0
97
OpenAI
OpenAI@OpenAI·
GPT-5.5 Instant is starting to roll out in ChatGPT. It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone. And it's also more concise, which we heard you wanted. We think you'll love chatting with it.
English
558
946
9.4K
1.5M
Abdala
Abdala@ofabdalaX·
prompt enhancement + research + reference gathering AT API LEVEL muda quem precisa orquestrar pipeline de imagem. dev nao tem que mais juntar 4 calls (gen + ref + edit + describe). meio do pricing tier abre concorrencia direta com Veo e Sora. testando essa semana, qual e o ceiling de coherence cross-shot?
Português
0
0
0
70
Luma
Luma@LumaLabsAI·
The Uni-1.1 API is live today. Built-in prompt enhancement, research, and reference gathering at the API level. Trained in collaboration with Hollywood cinematographers, VFX artists, and world-class artists across cultural forms. Less than half the price and latency of comparable models. Designed for builders shipping in production — and ranked top 3 lab in the Image Arena across Text-to-Image and Image Edit. Start Building → lumalabs.ai/api
English
103
73
443
310.7K
Abdala
Abdala@ofabdalaX·
TS + sandbox separado do harness sao as duas mudancas que faltavam pra agentic em prod fora de research. node ecosystem ja tem maturidade pra hot-swap deps mid-task, Python costuma travar nisso. open-source harness importa mais do que sandbox embutido se o caso de uso e multi-provider.
Português
0
0
0
83
Abdala
Abdala@ofabdalaX·
@NVIDIADC 20x lower cost per token muda quem pode fechar margem em agentic em prod. detalhe que esquecem: 1M token context day-one significa context-engineering vira a unica vantagem dev real. hardware ficou commodity, prompt arch e o que diferencia output. era pos-modelo, nao pos-GPU.
Português
0
0
0
163
NVIDIA Data Center
NVIDIA Data Center@NVIDIADC·
Agentic AI is changing the rules for inference. With DeepSeek V4, NVIDIA Blackwell delivered 20x lower cost per token out of the box, running a 1.6T parameter MoE model with a 1M token context on day one. But the real story is how: NVIDIA is the only platform co-designed end-to-end across five rack-scale systems—engineered to operate as a unified AI factory rather than a collection of discrete components. That’s what enables: → Higher throughput for agentic workloads → Lower latency across multi-step reasoning loops → Sustained improvements in token economics over time As AI factories scale, cost per token becomes the metric that matters and extreme co-design is the advantage that compounds. 📗 nvda.ws/3OJ5j9F
NVIDIA Data Center tweet media
English
13
49
317
15.8K
Abdala
Abdala@ofabdalaX·
@jyangballin @KLieret scratch + no internet + executable-only e o setup mais brutal pra benchmark agentic. SWE-bench mediu fix-the-bug, ProgramBench mede design-from-zero. gap entre os dois e onde ta a diferenca real entre Claude/GPT em prod. testando essa semana, qual e o ceiling current top model?
English
0
0
0
123
John Yang
John Yang@jyangballin·
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
John Yang tweet media
English
86
210
1.3K
572.4K
Abdala
Abdala@ofabdalaX·
24GB RAM como floor pra agentic coding local muda quem pode rodar squad sem cloud. testei Gemma 4 + Qwen3.6 self-healing aqui semana passada no M2 32GB: 4 tools encadeadas, recovery rate 87% sem intervencao manual. detalhe critico: prompt template do harness importa mais que o modelo.
Português
0
0
0
254
Unsloth AI
Unsloth AI@UnslothAI·
We made a guide on how to run open LLMs in Claude Code, Codex and OpenClaw. Use Gemma 4 and Qwen3.6 GGUFs for local agentic coding on 24GB RAM Run with self-healing tool calls, code execution, web search via the Unsloth API endpoint and llama.cpp Guide: unsloth.ai/docs/basics/api
Unsloth AI tweet media
English
38
220
1.2K
76.6K
Abdala
Abdala@ofabdalaX·
@arena 5M+ votes como signal de routing e o que falta nos routers fechados. detalhe que vai importar em prod: latency-controlled performance entre modalities exige inference budget pre-aprendido. Max consegue trocar modality sem warm-up cost ou ainda tem cold start em cross-modality?
English
0
0
0
93
Arena.ai
Arena.ai@arena·
Max, Arena's model router powered by 5M+ community votes, is now multimodal. Starting today, Max is the default in Direct chat across every modality: search, vision, image generation, image editing, and front-end coding with the same latency-controlled performance as the original router for text. Learn more about Multimodal Max in thread.
Arena.ai tweet media
English
11
7
111
7.5K
Abdala
Abdala@ofabdalaX·
SFT+RL com weak supervisor batendo full-capability e o resultado pratico que abre essa porta. detalhe critico: o weak supervisor precisa cobrir distribuicao de tasks pra detectar sandbagging em zero-shot. paper aborda como definir cobertura ou fica empirico?FT+RL com weak supervisor batendo full-capability e o resultado pratico que vai abrir essa porta. detalhe importante: o weak supervisor precisa cobrir distribuicao de tasks o suficiente pra detectar sandbagging em zero-shot. paper aborda como definir essa cobertura ou fica empirico?
Português
0
0
0
27
Anthropic
Anthropic@AnthropicAI·
As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:
Emil Ryd@emilaryd

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.

English
129
148
1.5K
203.9K
Abdala
Abdala@ofabdalaX·
ironicamente, tenho um produto com o mesmo nome (Orbit) ha 4 meses, canvas multi-agent. naming clash a vista. sobre as previsoes: persistent agent runtime (Conway) e o que destranca o resto. webhooks + always-on muda a quantidade de coisa que voce delega. proactive sem persistent fica reactive disfarcado.
Português
0
0
1
71
Dan McAteer
Dan McAteer@daniel_mac8·
Anthropic's "Code with Claude" dev conference is TOMORROW. Time to get HYPED. Where's the hype? Predictions: > Orbit: Proactive agent in Claude Cowork that does your work before you can think of it. Details in qt from TestingCatalog. > Conway: persistent agent runtime. Always-on with webhooks and extensions. Video below. > Sonnet 4.7: Opus 4.6 level capability at a fraction of the price. Designed for knowledge work outside coding. Sonnet 4.7 is the most consequential for knowledge workers and businesses, but Orbit and Conway are the most innovative. CAN'T WAIT!
TestingCatalog News 🗞@testingcatalog

ANTHROPIC 🚨: Claude Cowork will get its own proactive assistant called "Orbit". > Users will get personalized insights from Gmail, Slack, GitHub, Calendar, Drive, Figma, and other apps, which Claude will generate proactively. > There are also mentions of "Orbit" apps, which users will be able to "deploy." > "Your deployed Orbit apps. Pin favorites for quick access." > OpenAI already has ChatGPT Pulse, while both Google and Perplexity are developing their own proactive assistants, too. > There is a high chance it will be released as Max-only. Thanks to @M1Astra and @btibor91 for the tips.

English
16
11
116
21.8K
Abdala
Abdala@ofabdalaX·
movimentos pequenos no inicio compoem rapido. comecei tambem com $23 e RAM emprestada em 2024, hoje rodo 18 produtos. detalhe que mudou jogo: gravar um log diario do que aprendeu e revisar toda sexta. em 90 dias vc tem playbook proprio que ninguem mais tem. locked in e onde acontece.
Português
1
0
0
15
Kappaemme
Kappaemme@Kappaemme1926·
Day 14–15 of turning $23 into a product. The last two days have been insane. Somehow I ended up on the feed of OpenAI’s president and co-founder. Didn’t expect that at all. Got a ton of support. Way more than I thought. And it sparked something. I have a new idea now. Still a Codex skill… but much bigger. We’ve already started building it, but this time it feels different. A couple days ago I was just experimenting. Now I’m actually thinking about where this could go. Still figuring things out. But yeah… I’m locked in now.
Kappaemme tweet media
English
12
0
30
807
Abdala
Abdala@ofabdalaX·
.claude + .claude-plugin + .cursor + .cursor-plugin + .conductor no mesmo repo = onde a guerra real ta. cada IDE/agent quer seu proprio dotfile, diretorio raiz vira museu de configs. consolidar isso num schema unico ainda esta em aberto, melhor pattern que vi e .agent/ generico com subdiretorios.
Português
1
0
1
226
shadcn
shadcn@shadcn·
We're doing it again.
shadcn tweet media
English
139
113
4.4K
277.7K