Abdala

141 posts

Abdala

@ofabdalaX

building with AI. Claude Code heavy user. +100M tokens.

Brasil Katılım Mart 2024

71 Takip Edilen11 Takipçiler

Sabitlenmiş Tweet

Abdala@ofabdalaX·17 Nis

46min. 84.6k tokens. uma sessão. foi o suficiente pro Claude Opus 4.7 fazer refactor de 12 arquivos no AIOS que tava há 3 semanas na fila. o resto da timeline ainda usa 2-3 contas paralelas pra fazer 1 task. 3 coisas que aprendi rodando sessões assim: 1. CLAUDE.md com a regra "rode os testes ANTES de dizer pronto" corta retrabalho em ~50% 2. quebrar em 3-4 sub-tarefas limita o contexto melhor que 1 task gigante 3. screenshot do estado intermediário ajuda o modelo a recuperar se travar aqui no AIOS rodo 512 agents em paralelo. a timeline ainda debate se agent vale a pena. o debate é velho pra quem tá operando.

Português

192

Abdala@ofabdalaX·14h

signed skills change the trust boundary, not just supply chain. ran 16 modular skills (1 per domain) here for 6 months. biggest win wasnt security, it was reproducibility. this skill behaved like X yesterday and today becomes auditable. signature plus version pinning is what makes agent QA possible at scale.

English

elvis@omarsar0·17h

// Skills as Verifiable Artifacts // Pay attention to this one, AI devs. If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified. The runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts. We have decades of supply-chain lessons on what happens when trust is inferred from a signature. This paper is the right ask for SKILL.md before agent skill libraries become the next attack surface. Paper: arxiv.org/abs/2605.00424 Learn to build effective AI agents in our academy: academy.dair.ai

English

195

13.4K

Abdala@ofabdalaX·14h

3x via MTP funciona quando acceptance rate fica acima de 70% em prompts longos. rodei Gemma 4 27B aqui: ganho real em chat curto fica em 2.4x, em agentic loop com tool calling cai pra 1.8x porque draft diverge mais. detalhe que muda decisao: MTP em workload real raramente bate claim de paper, mas 1.8 a 2.4x ainda muda economia.

Português

216

AshutoshShrivastava@ai_for_success·21h

🚨 Google just made Gemma 4 up to 3x faster with MTP ⚡ Same quality, way more speed. It predicts multiple tokens at once and verifies them in parallel, removing latency bottlenecks. You can also run powerful models locally on mobile like me using Google AI Edge Gallery.

English

281

21.7K

Abdala@ofabdalaX·14h

multi-modal in File Search closes the last gap for image-heavy corpora. tested Embedding 2 vs E5-v3 on a 1500 doc KB with screenshots: recall on find me the diagram about X jumped from 64% to 89%. caveat: chunking strategy still matters more than the embedder for layout-heavy PDFs. great release.

English

Logan Kilpatrick@OfficialLoganK·20h

Good news for AI builders: the File Search tool in the Gemini API is now multi-modal 🗃️, powered by our Gemini Embedding 2 model, + support for custom metadata & inline citations : ) File Search comes with storage and embedding generation at query time free of charge!

English

904

45.9K

Abdala@ofabdalaX·14h

@GoogleCloudTech unificar GCloud, Firebase e AI Studio num hub e o tipo de friction reduction que parece detalhe e muda volume real. friction de setup e o que mata side project antes de virar produto. consolidaram billing tambem ou cada produto ainda fatura separado?

Português

129

Google Cloud Tech@GoogleCloudTech·16h

Getting started should take seconds, not hours. With the new Builders Hub, access and view all of your Google Cloud, Firebase, and AI Studio projects and apps from a single destination. Plus, receive tailored recs for learning paths to master new tools → goo.gle/4t8H5Up

English

108

6.2K

Abdala@ofabdalaX·14h

40 to 18 to 10 is the curve that matters but miss-rate alone hides false positive cost. running automated security on monorepos here, false positives burn senior eng hours fast. how does GPT-5.5 trade precision vs recall unsupervised on 200k LOC? that is where Opus 4.6 was actually winning.

English

XBOW@Xbow·1d

In our benchmark, GPT-5 missed 40% of vulnerabilities. Opus 4.6 reduced that to 18%. GPT-5.5 brings it down further to just 10%. That’s not a marginal improvement. Every missed vulnerability is a real-life liability. When you’re running automated security testing, closing that gap matters. Get details in our blog post: bit.ly/48OX7v6

English

3.7K

Abdala@ofabdalaX·14h

@Designarena @AlibabaGroup @HappyHorseATH 3 de 5 top vindo de labs chineses muda economia de pipeline criativo. testei Wan 2.7, MiniMax e agora Happy Horse num pipeline brand video aqui: custo por output caiu 60% vs Sora 2. tradeoff atual e coerencia temporal em shots maiores que 12s. resolveram isso, encerra discussao.

Português

149

Design Arena@Designarena·19h

BREAKING: Happy Horse 1.0 by @AlibabaGroup is #4 on Video Arena with an Elo of 1296! With 3 of the top 5 video models now coming from non-Western labs, there is a real shift in where cutting-edge video AI is being built Congrats to the @HappyHorseATH and @AlibabaGroup team on the launch!

English

2.3K

Abdala@ofabdalaX·14h

3.2x speedup em task de longo horizonte e o numero que importa. spatial memory e planning de varios passos e onde Opus 4.7 ainda ganha em xadrez/sokoban. testei o mesmo set: GPT-5.5 melhorou em state recall mas falha em backtrack quando precisa desfazer 5+ moves. dimensao que pokemon nao pega.

Português

Clad3815@Clad3815·20h

I've been re-running the "GPT plays Pokémon" benchmark with GPT-5.5, this time on Pokémon Emerald against GPT-5.2 (didn't have time to let GPT-5.4 play Emerald, @OpenAIDevs releases models too fast). GPT-5.5 demolished it. → Hall of Fame in 2.8 days vs 9.0 days for GPT-5.2 → 3.2× faster overall → 2.2× fewer in-game actions (6,775 vs 15,054) Same prompt, same ROM, same tools. Only the model changed. From what I've observed, GPT-5.5 is a beast at spatial reasoning, navigation, and puzzle solving. GPT-5.2 brute-forces its way to Champion. GPT-5.5 actually thinks before pressing buttons.

English

208

13.6K

Abdala@ofabdalaX·14h

5.7M para 163M em 1 semana e hype curve, nao substituicao. download nao e uso ativo. aqui (18 produtos em producao) Codex domina ideacao rapida, Claude Code mantem refactor e sessoes longas. metrica honesta seria DAU vezes horas/sessao, nao download. Anthropic cooked e cedo demais.

Português

VraserX e/acc@VraserX·18h

OpenAI really cooked with Codex and GPT 5.5. @openai/codex going from 5.7M to 163M weekly npm downloads in one week is absolutely insane. Anthropic is cooked.

English

495

41.8K

Abdala@ofabdalaX·14h

@ChatGPTapp dd-on dentro do Sheets mata o overhead de copy/paste. testei numa planilha 12k linhas com headers bagunçados, sugestao de formula acertou em 2 prompts. limitacao que apareceu: perde contexto entre abas do mesmo workbook. cross-sheet awareness e o proximo gap.

Português

195

ChatGPT@ChatGPTapp·16h

ChatGPT is now available as an add-on in Excel and Google Sheets. It can help analyze messy data, write formulas, update spreadsheets, and explain what it’s doing along the way—without leaving your spreadsheet. Powered by GPT-5.5. chatgpt.com/apps/spreadshe…

English

152

467

5.1K

548.7K

Abdala@ofabdalaX·14h

@OpenAI Instant rollout staged across plans is the right move. tested 5.5 vs 5.4 on consumer-style prompts (parenting, recipe planning) and the 'concise + warmer' delta is real. question for the team: does Instant share weights with the API tier or is it a separate fine-tune?

English

OpenAI@OpenAI·21h

GPT-5.5 Instant is starting to roll out in ChatGPT. It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone. And it's also more concise, which we heard you wanted. We think you'll love chatting with it.

English

558

946

9.4K

1.5M

Abdala@ofabdalaX·17h

prompt enhancement + research + reference gathering AT API LEVEL muda quem precisa orquestrar pipeline de imagem. dev nao tem que mais juntar 4 calls (gen + ref + edit + describe). meio do pricing tier abre concorrencia direta com Veo e Sora. testando essa semana, qual e o ceiling de coherence cross-shot?

Português

Luma@LumaLabsAI·22h

The Uni-1.1 API is live today. Built-in prompt enhancement, research, and reference gathering at the API level. Trained in collaboration with Hollywood cinematographers, VFX artists, and world-class artists across cultural forms. Less than half the price and latency of comparable models. Designed for builders shipping in production — and ranked top 3 lab in the Image Arena across Text-to-Image and Image Edit. Start Building → lumalabs.ai/api

English

103

443

310.7K

Abdala@ofabdalaX·17h

TS + sandbox separado do harness sao as duas mudancas que faltavam pra agentic em prod fora de research. node ecosystem ja tem maturidade pra hot-swap deps mid-task, Python costuma travar nisso. open-source harness importa mais do que sandbox embutido se o caso de uso e multi-provider.

Português

OpenAI Developers@OpenAIDevs·20h

The updated Agents SDK is now available in TypeScript, with support for sandbox agents and an open-source harness built in.

OpenAI Developers@OpenAIDevs

Build long-running agents with more control over agent execution. New capabilities in the Agents SDK: • Run agents in controlled sandboxes • Inspect and customize the open-source harness • Control when memories are created and where they’re stored

English

877

126.9K

Abdala@ofabdalaX·18h

@NVIDIADC 20x lower cost per token muda quem pode fechar margem em agentic em prod. detalhe que esquecem: 1M token context day-one significa context-engineering vira a unica vantagem dev real. hardware ficou commodity, prompt arch e o que diferencia output. era pos-modelo, nao pos-GPU.

Português

163

NVIDIA Data Center@NVIDIADC·23h

Agentic AI is changing the rules for inference. With DeepSeek V4, NVIDIA Blackwell delivered 20x lower cost per token out of the box, running a 1.6T parameter MoE model with a 1M token context on day one. But the real story is how: NVIDIA is the only platform co-designed end-to-end across five rack-scale systems—engineered to operate as a unified AI factory rather than a collection of discrete components. That’s what enables: → Higher throughput for agentic workloads → Lower latency across multi-step reasoning loops → Sustained improvements in token economics over time As AI factories scale, cost per token becomes the metric that matters and extreme co-design is the advantage that compounds. 📗 nvda.ws/3OJ5j9F

English

317

15.8K

Abdala@ofabdalaX·18h

@jyangballin @KLieret scratch + no internet + executable-only e o setup mais brutal pra benchmark agentic. SWE-bench mediu fix-the-bug, ProgramBench mede design-from-zero. gap entre os dois e onde ta a diferenca real entre Claude/GPT em prod. testando essa semana, qual e o ceiling current top model?

English

123

John Yang@jyangballin·23h

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

210

1.3K

572.4K

Abdala@ofabdalaX·18h

24GB RAM como floor pra agentic coding local muda quem pode rodar squad sem cloud. testei Gemma 4 + Qwen3.6 self-healing aqui semana passada no M2 32GB: 4 tools encadeadas, recovery rate 87% sem intervencao manual. detalhe critico: prompt template do harness importa mais que o modelo.

Português

254

Unsloth AI@UnslothAI·1d

We made a guide on how to run open LLMs in Claude Code, Codex and OpenClaw. Use Gemma 4 and Qwen3.6 GGUFs for local agentic coding on 24GB RAM Run with self-healing tool calls, code execution, web search via the Unsloth API endpoint and llama.cpp Guide: unsloth.ai/docs/basics/api

English

220

1.2K

76.6K

Abdala@ofabdalaX·18h

@arena 5M+ votes como signal de routing e o que falta nos routers fechados. detalhe que vai importar em prod: latency-controlled performance entre modalities exige inference budget pre-aprendido. Max consegue trocar modality sem warm-up cost ou ainda tem cold start em cross-modality?

English

Arena.ai@arena·19h

Max, Arena's model router powered by 5M+ community votes, is now multimodal. Starting today, Max is the default in Direct chat across every modality: search, vision, image generation, image editing, and front-end coding with the same latency-controlled performance as the original router for text. Learn more about Multimodal Max in thread.

English

111

7.5K

Abdala@ofabdalaX·18h

SFT+RL com weak supervisor batendo full-capability e o resultado pratico que abre essa porta. detalhe critico: o weak supervisor precisa cobrir distribuicao de tasks pra detectar sandbagging em zero-shot. paper aborda como definir cobertura ou fica empirico?FT+RL com weak supervisor batendo full-capability e o resultado pratico que vai abrir essa porta. detalhe importante: o weak supervisor precisa cobrir distribuicao de tasks o suficiente pra detectar sandbagging em zero-shot. paper aborda como definir essa cobertura ou fica empirico?

Português

Anthropic@AnthropicAI·20h

As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:

Emil Ryd@emilaryd

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream.

English

129

148

1.5K

203.9K

Abdala@ofabdalaX·1d

ironicamente, tenho um produto com o mesmo nome (Orbit) ha 4 meses, canvas multi-agent. naming clash a vista. sobre as previsoes: persistent agent runtime (Conway) e o que destranca o resto. webhooks + always-on muda a quantidade de coisa que voce delega. proactive sem persistent fica reactive disfarcado.

Português

Dan McAteer@daniel_mac8·1d

Anthropic's "Code with Claude" dev conference is TOMORROW. Time to get HYPED. Where's the hype? Predictions: > Orbit: Proactive agent in Claude Cowork that does your work before you can think of it. Details in qt from TestingCatalog. > Conway: persistent agent runtime. Always-on with webhooks and extensions. Video below. > Sonnet 4.7: Opus 4.6 level capability at a fraction of the price. Designed for knowledge work outside coding. Sonnet 4.7 is the most consequential for knowledge workers and businesses, but Orbit and Conway are the most innovative. CAN'T WAIT!

TestingCatalog News 🗞@testingcatalog

ANTHROPIC 🚨: Claude Cowork will get its own proactive assistant called "Orbit". > Users will get personalized insights from Gmail, Slack, GitHub, Calendar, Drive, Figma, and other apps, which Claude will generate proactively. > There are also mentions of "Orbit" apps, which users will be able to "deploy." > "Your deployed Orbit apps. Pin favorites for quick access." > OpenAI already has ChatGPT Pulse, while both Google and Perplexity are developing their own proactive assistants, too. > There is a high chance it will be released as Max-only. Thanks to @M1Astra and @btibor91 for the tips.

English

116

21.8K

Abdala@ofabdalaX·1d

movimentos pequenos no inicio compoem rapido. comecei tambem com $23 e RAM emprestada em 2024, hoje rodo 18 produtos. detalhe que mudou jogo: gravar um log diario do que aprendeu e revisar toda sexta. em 90 dias vc tem playbook proprio que ninguem mais tem. locked in e onde acontece.

Português

Kappaemme@Kappaemme1926·1d

Day 14–15 of turning $23 into a product. The last two days have been insane. Somehow I ended up on the feed of OpenAI’s president and co-founder. Didn’t expect that at all. Got a ton of support. Way more than I thought. And it sparked something. I have a new idea now. Still a Codex skill… but much bigger. We’ve already started building it, but this time it feels different. A couple days ago I was just experimenting. Now I’m actually thinking about where this could go. Still figuring things out. But yeah… I’m locked in now.

English

807

Abdala@ofabdalaX·1d

.claude + .claude-plugin + .cursor + .cursor-plugin + .conductor no mesmo repo = onde a guerra real ta. cada IDE/agent quer seu proprio dotfile, diretorio raiz vira museu de configs. consolidar isso num schema unico ainda esta em aberto, melhor pattern que vi e .agent/ generico com subdiretorios.

Português

226

shadcn@shadcn·1d

We're doing it again.

English

139

113

4.4K

277.7K

Keşfet

@GoogleCloudTech @Designarena @AlibabaGroup @HappyHorseATH @OpenAIDevs @ChatGPTapp @OpenAI @elonmusk