kai

1.1K posts

kai banner
kai

kai

@knightwolf30

llm eval & data from real world

Katılım Haziran 2012
347 Takip Edilen49 Takipçiler
kai retweetledi
Tatsunori Hashimoto
Tatsunori Hashimoto@tatsu_hashimoto·
Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.
Tatsunori Hashimoto tweet media
English
31
146
1.2K
205K
kai retweetledi
OpenAI
OpenAI@OpenAI·
Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.
English
1.1K
3.9K
26.7K
13.3M
kai retweetledi
Lun Wang
Lun Wang@lunwang1996·
I’ve left Google DeepMind after an amazing chapter. I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale. As I wrap up this chapter, I wrote down something I’ve been thinking about a lot: evals. We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations. wanglun1996.github.io/blog/your-eval…
English
55
195
1.8K
578.1K
kai retweetledi
Zac Valles
Zac Valles@zacharyvalles·
72 hours after YC demo day, I moved to Shenzhen for 8 weeks 🤠 I'm headed back to SF with new hardware in hand (sharing more soon), but some takeaways documented below: > If you have even the slightest ambition to found a hardware company, visit SZ. Pre-raise, pre-team, pre-idea, pre-job departure, it doesn't matter. Just go. > Plan your visit according to a major conference that interests you. Use that conference as a supplier meeting springboard - that's your ticket to any factory under the sun. > At the factories, ask about lead times, don't ask about cost (wait on this). Your iteration rate is driven by the lead time on the longest lead time item in your assembly. It pays to identify these parts early to build project timelines. > Visit Huaqiangbei (read: this is a mini-city, not a building). Robotic subassemblies, batteries, chassis's, electronic parts. They all have buildings where vendors are tightly clustered. Plan to spend 4-6 hours walking around before you find exactly what you're interested in. > Business relationships are valuable commodities. Treat them as such. Pay attention to people, learn about them. Bring thoughtful gifts. Wait for them to sit first. With Baiju, fill the glass but with tea leave some room. Cultural customs are fun to learn, but also convey a seriousness towards the working relationship. > Suppliers fit cleanly into discrete buckets. Level of complexity and execution on past projects indicates what is in scope for them. Trivial, but important to level your build expectations. It is easy to design a part with 12 subsequent manufacturing processes, exceptionally hard to find a supplier to fill this order. If you need coffeeshop recs, food recs, or hotel recs I have a few. Move to Shenzhen! Get to building!
Zac Valles tweet media
English
99
79
1.4K
315K
kai retweetledi
Nick
Nick@nickbaumann_·
My laptop has become a “satellite device” since I started using Codex from my phone. And my Mac mini has become the “home.” It’s clunky, but the end state feels more like how we’re going to be working in the near future: I’m currently running the Codex app on 2 devices: 1. my MacBook 2. my Mac mini My laptop isn’t reliably connected to Wi-Fi enough, so I keep a Mac mini on my desk that is always connected. When I kick off new threads from my phone, I start them on the Mac mini. When I’m working from my desk, I run them there too. The cool part is that I’ve added my MacBook and Mac mini as connected devices to each other. That means I can start and resume threads from either device. So if I’m in a meeting but want to continue a thread on my laptop that was started on my Mac mini, I can do that. I’ve also set up mutual SSH for Mac mini <> MacBook, so files are easy to access from either side. It’s not fully seamless yet, but the model works. What this means: - I have an always-on Codex that is accessible from my phone, with its own dev environment - All threads are always accessible from any of the 3 devices - I can run heartbeat threads that stay on 24/7 It’s a little makeshift today, but the shape of it feels very real to me: Codex is no longer tied to whichever computer happens to be open in front of me. It starts to feel like something I can stay connected to across whatever device I’m using.
Nick tweet media
English
120
109
1.8K
412.5K
kai
kai@knightwolf30·
@tydsh Congrats!!
English
0
0
1
495
Yuandong Tian
Yuandong Tian@tydsh·
Today we launch Recursive. We are building AI that discovers knowledge automatically and improves itself recursively, an open-ended process that will fundamentally change how science and technology advance. Our 25 top researchers and engineers in San Francisco and London bring diverse expertise spanning agentic AI scientists, architecture and algorithm design, world models, optimization, and interpretability, united by a shared conviction that this is the most important problem we could be working on today. If you are interested in joining, please send your resume to talent@recursive.com. Follow us at @Recursive_SI!
Recursive@Recursive_SI

x.com/i/article/2054…

English
88
152
1.4K
168.9K
kai retweetledi
Deedy
Deedy@deedydas·
The Ultimate List of Artificial Intelligence "Neolabs": May 2026. A Neolab is a pre-revenue scale startup working on long-term AI breakthroughs, usually with a $1B+ valuation. There are now 63 of them!
Deedy tweet media
English
110
237
2.1K
539.2K
kai retweetledi
Banghua Zhu
Banghua Zhu@BanghuaZ·
Excited to launch RadixArk officially today! I have spent the past half year working with RadixArk and the SGLang community, and it has been the most rewarding experience I have had. RadixArk, and the SGLang community, has a very unique engineering culture. The code and the system have the final say. Feedback is direct because everyone trusts the intent. There is very little hierarchy around ideas, and good technical judgment matters more than title or seniority. With a high bar and fast feedback loops, people grow incredibly quickly. In many places, you spend most of your time looking at one company’s stack. Here, through SGLang community, we get to see the forest, not just the trees: many labs, companies, hardware platforms, workloads, and real production systems. There is a lot of exciting work ahead across inference, training, RL, orchestration, kernels, multi-hardware, and many real-world systems problems in between. If you love coding, enjoy building real systems, and want to work on the full AI stack from inference to training, come join us at RadixArk. This is just the beginning.
RadixArk@radixark

Today, we are thrilled to officially launch RadixArk with $100M in Seed funding at a $400M valuation. The round was led by @Accel and co-led by @sparkcapital. RadixArk exists to make frontier AI infrastructure open and accessible to everyone. Today, the systems behind the most capable AI models are concentrated in a small number of companies. As a result, most AI teams are forced to rebuild training and inference stacks from scratch, duplicating the same infrastructure work instead of focusing on new models, products, and ideas. RadixArk was founded to change that. We are building an AI platform that makes it easier for teams to train and serve the best models at scale. RadixArk comes from the open-source community. We started with SGLang, where many of us are core developers and maintainers, and expanded our work to Miles for large-scale RL and post-training. We will continue contributing to both projects and working with the community to make them the strongest open-source infrastructure foundations for frontier AI. We would like to thank our long-term partners, contributors, and the broader SGLang community for believing in this mission. We're also grateful to @Accel and @sparkcapital, NVentures (Venture capital arm of @nvidia), Salience Capital, A&E Investment, @HOFCapital, @walden_catalyst, @AMD, LDVP, WTT Fubon Family, @MediaTek, Vocal Ventures, @Sky9Capital and our angel investors @ibab, @LipBuTan1, Hock Tan, @johnschulman2, @soumithchintala, @lilianweng, @oliveur, @Thom_Wolf, @LiamFedus, @robertnishihara, @ericzelikman, @OfficialLoganK, and @multiply_matrix among others. Thanks for the exclusive interview with @MeghanBobrowsky at @WSJ about our vision.

English
31
28
263
25.1K
kai retweetledi
Jason Weston
Jason Weston@jaseweston·
💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: facebookresearch.github.io/RAM/blogs/auto…
Jason Weston tweet media
English
0
103
615
41.6K
kai
kai@knightwolf30·
@joabaum Thanks for sharing! Amazing work
English
0
0
1
117
kai retweetledi
Joachim Baumann @ ICLR'26
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
78
476
69.8K
kai retweetledi
OpenAI Newsroom
OpenAI Newsroom@OpenAINewsroom·
We’re introducing a Bio Bug Bounty for GPT‑5.5 and accepting applications In our ongoing work to strengthen our safeguards for advanced AI capabilities in biology, we’re inviting researchers with experience in AI red teaming, security, or biosecurity to try to find a universal jailbreak that can defeat our 5-question bio safety challenge. Learn more in our blog ⬇️ openai.com/index/gpt-5-5-…
English
94
180
2.2K
201.1K
kai retweetledi
Stella Li
Stella Li@StellaLisy·
Millions of users now have months-long conversation histories with AI assistants💬 But this data is proprietary and unavailable to the academic community for research, training, or benchmarking. We introduce HorizonBench🌅, a benchmark and data generator for long-horizon personalization: tracking a user's current preferences across a history where life events have silently changed them.
Stella Li tweet media
English
6
38
250
19.2K
kai retweetledi
OpenAI
OpenAI@OpenAI·
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
English
2.5K
6.9K
51.7K
13.1M
kai retweetledi
kai
kai@knightwolf30·
If anyone wants to get more out of their own agent sessions, made this: github.com/kai-rayward/cl… I'm often curious about my old Claude Code / Codex sessions (which ones actually worked, what patterns keep showing up, where i'm wasting time), so just built ClawJournal:)
English
0
0
0
42