DC

346 posts

DC banner
DC

DC

@vibecoder_dc

Building @use_fo: Next-gen AI Voice Typing. 🎙️ Vibe Coder | AI & Blockchain Specialist. Serial Founder. Turning intent into logic at the speed of thought. ⚡

Katılım Nisan 2024
102 Takip Edilen5.5K Takipçiler
DC retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working layer. The problem is that an LLM by itself is mostly a text predictor, so long tasks can lose state, hide mistakes, and turn plans into actions in fragile ways. The real advance is not “AI writes code,” but “AI uses code as the environment it thinks inside.” The authors call the surrounding system an agent harness, meaning the tools, memory, sandboxes, checks, and feedback loops that turn a model into an agent. Their core idea is that code should sit at the center of that harness, because code can be run, inspected, checked, saved, edited, and shared. Tests become sensors. Repositories become memory. Logs become history. Sandboxes become boundaries. A generated script is no longer merely an answer; it is a handle the system can run, check, revise, share, and roll back. The main finding is a pattern across many fields: code helps agents reason through executable steps, act through tool calls or control programs, and model environments through tests, traces, logs, repositories, and simulators. ---- Paper Link – arxiv. org/abs/2605.18747 Paper Title: "Code as Agent Harness"
Rohan Paul tweet media
English
12
16
47
3.1K
DC retweetledi
Dan McAteer
Dan McAteer@daniel_mac8·
This is genius. SkillOpt: treats AI agent skills as trainable external state. Run the agent. Score the rollout. Edit the skill. Keep the edit only if the eval improves. These researchers built an optimizer for text-based skills: - textual learning-rate budget - rejected-edit buffer - epoch-wise meta updates - zero extra inference-time calls at deployment Across 6 benchmarks, 7 models, and 3 harnesses, SkillOpt was best or tied in all 52 evaluated cells. On GPT-5.5, it improved average no-skill accuracy by: +23.5 pts in direct chat +24.8 pts in Codex +19.1 pts in Claude Code This is continual learning for agent skills. The harness is at least as important as the model. Maybe more so.
Dan McAteer tweet media
English
5
2
12
643
DC
DC@vibecoder_dc·
@HuggingPapers Unified audio models are the 'Swiss Army Knife' of AI: they do it all, but you'll miss a scalpel when chasing sub-100ms latency. Real win is making the hand-off between modules invisible.
English
0
0
0
17
DailyPapers
DailyPapers@HuggingPapers·
StepAudio 2.5: one model for speech recognition, synthesis, and live dialogue A unified audio-language foundation model that uses task-tailored RLHF to match or exceed specialized systems across ASR, text-to-speech, and real-time spoken interaction.
DailyPapers tweet media
English
2
5
11
725
DC retweetledi
Yifan Yang
Yifan Yang@Yif_Yang·
🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 aka.ms/skillopt 📄 huggingface.co/papers/2605.23…
English
22
51
389
32.2K
DC
DC@vibecoder_dc·
@sheriyuo Vanilla SFT on dLLMs is like teaching a rare dialect while showing only 10% of the page. LIFT just admits: you need both to learn.
English
0
0
1
33
Xiuyu Li
Xiuyu Li@sheriyuo·
LIFT is the SFT recipe for dLLMs that actually understands the masking dynamics. Vanilla SFT on dLLMs often HURTS performance, and they finally pin down why. Their analysis: vanilla SFT overlooks learnability. Rare tokens are difficult to learn when most of the input is masked because the model has nothing to ground them in. Common tokens are easy and of little value to learn when most of the input is unmasked because the answer is essentially already given. LIFT aligns training with the information available at different diffusion time steps. Learn easy tokens when most of the input is masked (build up basic vocabulary at the noisy end), and learn hard tokens when more context is available (let the model use that context). The schedule matches the difficulty of each token to the moment the model is best positioned to absorb it. Learnability-Informed Fine-Tuning of Diffusion Language Models Paper: arxiv.org/abs/2605.22939 Code: github.com/divelab/LIFT
Xiuyu Li tweet media
English
1
2
23
2K
DC retweetledi
Pushmeet Kohli
Pushmeet Kohli@pushmeet·
AI agents are advancing research-level math. 🚀 I’m thrilled to share @GoogleDeepMind’s AlphaProof Nexus - an agentic framework for formal proof search powered by Gemini. When applied to a set of open formal math problems, our agent autonomously solved: ✅ 9 open Erdős problems (including two open for 56 years!) ✅ 44 Online Encyclopedia of Integer Sequences (OEIS) problems ✅ A 15-year-old open problem in algebraic geometry ✅ A 7-year-old open question in min-max optimization We are collaborating with mathematicians across disciplines - from combinatorics and graph theory to quantum optics. Ultimately, these results show the massive potential of even simple agentic loops powered by Gemini. Read the paper here: arxiv.org/abs/2605.22763…
Pushmeet Kohli tweet media
English
60
170
1.1K
87.1K
DC
DC@vibecoder_dc·
@sheriyuo Summarization is just lossy compression. Like using SparkNotes for a book: you get the plot, but lose the nuance that makes the agent smart.
English
1
0
1
72
Xiuyu Li
Xiuyu Li@sheriyuo·
Long-horizon LLM agents accumulate conversation histories that blow past the context window. The usual fix is LLM-based summarization, which is lossy AND blocks the agent for tens of seconds while the summarizer runs. Parallel Context Compaction from PSU point out two specific failures of sequential compaction: 1. summary volume is uncontrollable because prompt instructions are largely ignored by the summarizer. 2. the amount of retained information fluctuates substantially across runs, so the agent's knowledge becomes non-deterministic between invocations. Their parallel compaction restructures the call so summarization overlaps with agent inference, and the operator gets fine-grained, predictable control over summary volume. If you serve long-horizon agents in production this is a free latency win. The non-determinism fix alone justifies the change. Parallel Context Compaction for Long-Horizon LLM Agent Serving Paper: arxiv.org/abs/2605.23296
Xiuyu Li tweet media
English
2
3
57
2.7K
DC retweetledi
elvis
elvis@omarsar0·
New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize. Probably not optimal. This works show why. It treats the skill doc as a trainable external state of a frozen agent instead. It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes. SkillOpt is best or tied on all 52 (model, benchmark, harness) cells. On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses. Paper: arxiv.org/abs/2605.23904 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
26
88
471
38.6K
DC retweetledi
0xSero
0xSero@0xSero·
Deepseek-v4-Flash beats Sonnet, and Opus-4.5 (no thinking) and basically matches GPT-5.2 medium Tomorrow I will have a compression of Flash that'll make it fit well on 1x Spark at hopefully better quality than alternatives Join tomorrow: luma.com/reap
0xSero tweet media
English
27
19
498
30K
DC retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Read the paper: huggingface.co/papers/2605.20… DAR also stacks with REPA for 2× early-stage training speedups. It preserves high-frequency details during distillation for large-scale text-to-image models.
English
0
1
5
959
DC retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Microsoft just released SkillOpt Train agent skills like neural networks — in text space, without touching model weights. Best or tied-best in 52/52 settings across 6 benchmarks and 7 models.
DailyPapers tweet media
English
1
23
127
7.7K
DC retweetledi
⚪️ sierra catalina
⚪️ sierra catalina@sierracatalina·
for @jinsyu SAM 3 approaches monocular depth estimation by combining object-level boundary precision with a 3D structural decoder: MLP Depth Head: SAM 3 incorporates a Multilayer Perceptron (MLP) depth prediction layer directly on top of its core image encoder features to process single RGB inputs. Semantic Guidance: Because SAM 3 is an expert at defining exact object boundaries, it solves a major flaw in classic depth models—blurry edges. It uses the segmentation masks to ensure sharp depth discontinuities at object boundaries. 3D Mesh Integration: Through TSDF (Truncated Signed Distance Function) fusion, the predicted monocular depth is immediately converted into real 3D meshes or point clouds, allowing users to move from a flat image to a spatial model.
English
2
3
10
849
DC retweetledi
DailyPapers
DailyPapers@HuggingPapers·
Microsoft released Lens on Hugging Face A 3.8B-parameter text-to-image model that achieves SOTA quality with just 19.3% of the training compute used by Z-Image, generating 1024px images in 3.15s and resolutions up to 1440×1440.
DailyPapers tweet media
English
1
3
34
3K
DC retweetledi
sway
sway@SwayStar123·
The deepseek v4 paper only compares FLOPS vs DS v3.2 (which used DSA) If you add DS v3 to the graph it looks like this
sway tweet media
English
0
6
38
5.1K
DC
DC@vibecoder_dc·
@DanKornas Agent demos are high-fidelity cardboard facades. Looks great in a photo, but try moving in and you'll find no plumbing and the moment it rains.
English
0
0
0
60
Dan Kornas
Dan Kornas@DanKornas·
Agent demos are easy. The production stack is the messy part. Awesome Production Agentic Systems is a curated GitHub list of open-source libraries for deploying, monitoring, versioning, scaling, and securing production agentic systems and applications. It helps you move past random agent tutorials by grouping the ecosystem into practical sections you can scan when choosing frameworks, observability tools, protocols, memory layers, security tooling, prompt-engineering resources, and interfaces. Key features: • Agentic frameworks – browse libraries such as ADK, AutoGen, LangGraph, CrewAI, OpenAI Agents SDK, PydanticAI, and more • Observability tools – find projects for monitoring, evaluation, behavior judging, cost tracking, token tracking, and performance visibility • Protocols and interoperability – includes agent communication/client protocols like A2A, ACP, AgentAPI, agents.json, ANP, AP2, and MCP-related tooling • Production concerns beyond frameworks – separate sections cover memory management, agent security, prompt engineering, and agent interfaces • Update/contribution path – README points readers to watch repo releases for monthly additions and submit PRs via CONTRIBUTING.md It’s open-source (MIT license). Link in the reply 👇
Dan Kornas tweet media
English
3
10
34
1.9K
DC retweetledi
Xiuyu Li
Xiuyu Li@sheriyuo·
3.8% for Claude Opus 4.7 and 0.0% for Gemini 3.1 Pro SaaS-Bench from UniPat AI just dragged Computer-Use Agent benchmark theater into the cold light. They put 23 real open-source SaaS systems into Docker with full DB state and business constraints, gave agents Browser-Use, and measured 106 tasks across software dev, finance, medical, collab, supply chain, and media. The gap between "Checkpoint Score" (partial credit, weighted sub-steps) and "Resolved Score" (all checkpoints pass) is where the bodies are buried. Agents look fine if you score per click. Once you require the actual outcome, they collapse. Long-horizon, cross-app, real backend state is the gap. Until then, "Computer-Use Agent" is a benchmark term, not a product. Blog: unipat.ai/blog/SaaS-Bench GitHub: github.com/UniPat-AI/SaaS… Paper: arxiv.org/abs/2605.15777
Xiuyu Li tweet media
English
5
18
112
12.9K
DC retweetledi
Dan Kornas
Dan Kornas@DanKornas·
Agents shouldn’t relearn the same tool-use patterns forever. SkillX is a framework for automatically constructing reusable, plug-and-play skill knowledge bases for LLM agents from experience. It helps you turn successful agent trajectories into reusable skills by distilling them into a structured hierarchy that can be retrieved and injected into other agents. Key features: • Three-level skill hierarchy – separates planning skills, functional tool subroutines, and atomic tool-usage patterns • Automated KB construction – rolls out agents, extracts skills from successful trajectories, consolidates, filters, and builds the library • Iterative refinement – merges redundant skills, filters brittle or hallucinated ones, and updates skills from execution feedback • Exploratory expansion – targets under-used and failure-prone tools, then synthesizes new tasks to grow coverage • Plug-and-play transfer – skill libraries can be injected into different base agents without retraining the model It’s open-source (MIT license). Link in the reply 👇
Dan Kornas tweet media
English
4
18
58
2.8K
DC retweetledi
Elon Musk
Elon Musk@elonmusk·
Grok foundation model V9-Medium (1.5T) has finished training. Evals look good. A lot of Cursor data was added in supplementary training and there is more to come. Fine-tuning is underway and reinforcement learning begins in a few days. 2 to 3 weeks to public release. This will be a major improvement over the 0.5T v8-small that currently serves all Grok production traffic, especially for difficult coding tasks.
English
5.6K
6.8K
57.6K
12M
DC retweetledi
Eason
Eason@learningPikachu·
Phase 2 of my heuristic-learning ImageNet-10 experiment: Inspired by @Trinkle23897's “Learning Beyond Gradients,” I used Claude Code + Codex to iteratively improve a pure symbolic vision system. No neural nets. No backprop. Just visual rules, reranking, verification, logs, and code edits. Current reproducible: - full verify: 84.0% train / 50.5% val - base+rerank: 55.4% train / 51.9% val Archived run reached 100% train, but exact code state is not currently reproducible. Takeaway: - Symbolic HL can fit surprisingly well. - The bottleneck is generalization. - If code is the model, then code complexity is model complexity. Check out: github.com/xisen-w/hl-ima… Blog: github.com/xisen-w/hl-ima…
Eason tweet media
English
12
5
23
2.7K