kai

1.1K posts

kai

@knightwolf30

llm eval & data from real world

Katılım Haziran 2012

347 Takip Edilen49 Takipçiler

kai retweetledi

Tatsunori Hashimoto@tatsu_hashimoto·21 May

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

English

146

1.2K

205K

kai retweetledi

OpenAI@OpenAI·20 May

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

English

1.1K

3.9K

26.7K

13.3M

kai retweetledi

Lilian Weng@lilianweng·19 May

We would love to see more collaboration and research in the field of human-AI interactivity. Check it out!

Thinking Machines@thinkymachines

We are offering grants of $100,000 + Tinker credits to researchers advancing the field of human-AI interactivity. Submit your proposals by June 19th! thinkingmachines.ai/news/interacti…

English

410

94.4K

kai retweetledi

Lun Wang@lunwang1996·18 May

I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. wanglun1996.github.io/blog/your-eval…

English

195

1.8K

578.1K

kai retweetledi

Zac Valles@zacharyvalles·16 May

72 hours after YC demo day, I moved to Shenzhen for 8 weeks 🤠 I'm headed back to SF with new hardware in hand (sharing more soon), but some takeaways documented below: > If you have even the slightest ambition to found a hardware company, visit SZ. Pre-raise, pre-team, pre-idea, pre-job departure, it doesn't matter. Just go. > Plan your visit according to a major conference that interests you. Use that conference as a supplier meeting springboard - that's your ticket to any factory under the sun. > At the factories, ask about lead times, don't ask about cost (wait on this). Your iteration rate is driven by the lead time on the longest lead time item in your assembly. It pays to identify these parts early to build project timelines. > Visit Huaqiangbei (read: this is a mini-city, not a building). Robotic subassemblies, batteries, chassis's, electronic parts. They all have buildings where vendors are tightly clustered. Plan to spend 4-6 hours walking around before you find exactly what you're interested in. > Business relationships are valuable commodities. Treat them as such. Pay attention to people, learn about them. Bring thoughtful gifts. Wait for them to sit first. With Baiju, fill the glass but with tea leave some room. Cultural customs are fun to learn, but also convey a seriousness towards the working relationship. > Suppliers fit cleanly into discrete buckets. Level of complexity and execution on past projects indicates what is in scope for them. Trivial, but important to level your build expectations. It is easy to design a part with 12 subsequent manufacturing processes, exceptionally hard to find a supplier to fill this order. If you need coffeeshop recs, food recs, or hotel recs I have a few. Move to Shenzhen! Get to building!

English

1.4K

315K

kai retweetledi

Nick@nickbaumann_·15 May

My laptop has become a “satellite device” since I started using Codex from my phone. And my Mac mini has become the “home.” It’s clunky, but the end state feels more like how we’re going to be working in the near future: I’m currently running the Codex app on 2 devices: 1. my MacBook 2. my Mac mini My laptop isn’t reliably connected to Wi-Fi enough, so I keep a Mac mini on my desk that is always connected. When I kick off new threads from my phone, I start them on the Mac mini. When I’m working from my desk, I run them there too. The cool part is that I’ve added my MacBook and Mac mini as connected devices to each other. That means I can start and resume threads from either device. So if I’m in a meeting but want to continue a thread on my laptop that was started on my Mac mini, I can do that. I’ve also set up mutual SSH for Mac mini <> MacBook, so files are easy to access from either side. It’s not fully seamless yet, but the model works. What this means: - I have an always-on Codex that is accessible from my phone, with its own dev environment - All threads are always accessible from any of the 3 devices - I can run heartbeat threads that stay on 24/7 It’s a little makeshift today, but the shape of it feels very real to me: Codex is no longer tied to whichever computer happens to be open in front of me. It starts to feel like something I can stay connected to across whatever device I’m using.

English

120

109

1.8K

412.5K

kai@knightwolf30·13 May

@tydsh Congrats!!

English

495

Yuandong Tian@tydsh·13 May

Today we launch Recursive. We are building AI that discovers knowledge automatically and improves itself recursively, an open-ended process that will fundamentally change how science and technology advance. Our 25 top researchers and engineers in San Francisco and London bring diverse expertise spanning agentic AI scientists, architecture and algorithm design, world models, optimization, and interpretability, united by a shared conviction that this is the most important problem we could be working on today. If you are interested in joining, please send your resume to talent@recursive.com. Follow us at @Recursive_SI!

Recursive@Recursive_SI

x.com/i/article/2054…

English

152

1.4K

168.9K

kai retweetledi

Deedy@deedydas·7 May

The Ultimate List of Artificial Intelligence "Neolabs": May 2026. A Neolab is a pre-revenue scale startup working on long-term AI breakthroughs, usually with a $1B+ valuation. There are now 63 of them!

English

110

237

2.1K

539.2K

kai retweetledi

Banghua Zhu@BanghuaZ·5 May

Excited to launch RadixArk officially today! I have spent the past half year working with RadixArk and the SGLang community, and it has been the most rewarding experience I have had. RadixArk, and the SGLang community, has a very unique engineering culture. The code and the system have the final say. Feedback is direct because everyone trusts the intent. There is very little hierarchy around ideas, and good technical judgment matters more than title or seniority. With a high bar and fast feedback loops, people grow incredibly quickly. In many places, you spend most of your time looking at one company’s stack. Here, through SGLang community, we get to see the forest, not just the trees: many labs, companies, hardware platforms, workloads, and real production systems. There is a lot of exciting work ahead across inference, training, RL, orchestration, kernels, multi-hardware, and many real-world systems problems in between. If you love coding, enjoy building real systems, and want to work on the full AI stack from inference to training, come join us at RadixArk. This is just the beginning.

RadixArk@radixark

Today, we are thrilled to officially launch RadixArk with $100M in Seed funding at a $400M valuation. The round was led by @Accel and co-led by @sparkcapital. RadixArk exists to make frontier AI infrastructure open and accessible to everyone. Today, the systems behind the most capable AI models are concentrated in a small number of companies. As a result, most AI teams are forced to rebuild training and inference stacks from scratch, duplicating the same infrastructure work instead of focusing on new models, products, and ideas. RadixArk was founded to change that. We are building an AI platform that makes it easier for teams to train and serve the best models at scale. RadixArk comes from the open-source community. We started with SGLang, where many of us are core developers and maintainers, and expanded our work to Miles for large-scale RL and post-training. We will continue contributing to both projects and working with the community to make them the strongest open-source infrastructure foundations for frontier AI. We would like to thank our long-term partners, contributors, and the broader SGLang community for believing in this mission. We're also grateful to @Accel and @sparkcapital, NVentures (Venture capital arm of @nvidia), Salience Capital, A&E Investment, @HOFCapital, @walden_catalyst, @AMD, LDVP, WTT Fubon Family, @MediaTek, Vocal Ventures, @Sky9Capital and our angel investors @ibab, @LipBuTan1, Hock Tan, @johnschulman2, @soumithchintala, @lilianweng, @oliveur, @Thom_Wolf, @LiamFedus, @robertnishihara, @ericzelikman, @OfficialLoganK, and @multiply_matrix among others. Thanks for the exclusive interview with @MeghanBobrowsky at @WSJ about our vision.

English

263

25.1K

kai retweetledi

Jason Weston@jaseweston·1 May

💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: facebookresearch.github.io/RAM/blogs/auto…

English

103

615

41.6K

kai retweetledi

Nathan Lambert@natolambert·1 May

Distillation is largely an industry standard and not just something done by Chinese labs targeting OpenAI/Anthropic. Many American companies also distill Chinese (open) models.

MTS@MTSlive

LIVE TRIAL UPDATE: OpenAI's counsel asked Musk whether xAI has ever "distilled" technology from OpenAI. Musk: "Generally AI companies distill other AI companies." "Is that a yes?" Savitt asked. Musk: "Partly."

English

582

84.2K

kai retweetledi

Shao-Hua Sun@shaohua0116·28 Nis

RL is hitting a ceiling with human feedback. What if the world itself becomes the signal? Introducing RLxF: RL from World Feedback 🌍 workshop at ICML 2026 in Seoul, South Korea! Speakers: David Silver, @chelseabfinn, @jessezhang, @robertarail Web page: sites.google.com/view/rlxf-icml…

English

278

52.1K

kai@knightwolf30·28 Nis

@joabaum Thanks for sharing! Amazing work

English

117

kai retweetledi

Joachim Baumann @ ICLR'26@joabaum·27 Nis

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

English

476

69.8K

kai retweetledi

OpenAI Newsroom@OpenAINewsroom·24 Nis

We’re introducing a Bio Bug Bounty for GPT‑5.5 and accepting applications In our ongoing work to strengthen our safeguards for advanced AI capabilities in biology, we’re inviting researchers with experience in AI red teaming, security, or biosecurity to try to find a universal jailbreak that can defeat our 5-question bio safety challenge. Learn more in our blog ⬇️ openai.com/index/gpt-5-5-…

English

180

2.2K

201.1K

kai retweetledi

Stella Li@StellaLisy·24 Nis

Millions of users now have months-long conversation histories with AI assistants💬 But this data is proprietary and unavailable to the academic community for research, training, or benchmarking. We introduce HorizonBench🌅, a benchmark and data generator for long-horizon personalization: tracking a user's current preferences across a history where life events have silently changed them.

English

250

19.2K

kai retweetledi

OpenAI@OpenAI·23 Nis

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

English

2.5K

6.9K

51.7K

13.1M

kai retweetledi

clem 🤗@ClementDelangue·22 Nis

We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17

Anand Kannappan@anandnk24

People are misreading the SpaceX/Cursor deal as an M&A story. It’s actually a bet on what the real bottleneck in frontier coding models is. xAI has struggled to close the gap with Claude Code and Codex. Cursor sits on the best corpus of developer traces in the world. The deal lets Cursor train Composer on Colossus while xAI runs the same recipe on Grok. Both sides find out, at the same time, whether Cursor's data is actually the difference. The option structure reflects that uncertainty. If the training work ports over, SpaceX buys Cursor and owns the pipeline. If it doesn’t, they pay $10B for the experiment and walk. Either outcome, Grok ends up stronger than it would have been, and xAI gets an answer to a question it couldn’t answer internally. The part worth holding onto: a pre-IPO company just priced a live experiment to figure out whether real developer traces are the scarce input in coding agents. $10B is what they’re paying to run it. $60B is what the answer is worth if it comes back yes.

English

199

46.9K

kai@knightwolf30·22 Nis

If anyone wants to get more out of their own agent sessions, made this: github.com/kai-rayward/cl… I'm often curious about my old Claude Code / Codex sessions (which ones actually worked, what patterns keep showing up, where i'm wasting time), so just built ClawJournal:)

English

Keşfet

@tydsh @Recursive_SI @chelseabfinn @jessezhang @robertarail @joabaum @steipete @badlogicgames