Uzay

2.8K posts

Uzay

@uzpg_

elicitation @fulcrum_inc, previously at MIT 🇫🇷🇺🇸🇹🇷

SF Katılım Haziran 2020

1.4K Takip Edilen1.6K Takipçiler

Sabitlenmiş Tweet

Uzay@uzpg_·17 Haz

New @fulcrum_inc research - Agents are under-elicited: A case study in optimization tasks. We find that simple and general prompt/scaffold interventions can roughly double agent performance by getting agents to use more resources more efficiently. 🧵

GIF

English

2.4K

Uzay@uzpg_·2d

@ryanyang0 @neural_avb yeah agree

English

Ryan Yang-Liu@ryanyang0·3d

I considered working on them but if you actually go read the proof of the actual KA Rep Theorem that motivates everything (shocking!), it relies on a construction that requires a level of discontinuity that's not easy to get in general and certainly not nice with the recommended spline approx method. Idk what other reasons people reckoned out, eff impl seems hard too

English

1.9K

AVB@neural_avb·3d

Whatever happened to Kolmogorov Arnold Networks?

English

580

93K

Uzay@uzpg_·2d

@insecureinc i agree

English

Uzay@uzpg_·2d

strongly recommend the movie hiroshima my love. one of the most impactful pieces of media for me this year, and def a personal top-3 movie

English

126

Uzay@uzpg_·2d

very similar to GANs actually - have been getting gains from making agents "harden" vs destroy metrics iteratively

English

197

Uzay@uzpg_·2d

learning fuzzy tasks is a game of continuously creating new metrics and seeing them get destroyed

English

342

Uzay retweetledi

Greg Burnham@GregHBurnham·4d

We’ve run Fable 5 (max) and Sol 5.6 (max) on EBR-bench. Both get new high scores, but they achieve this with better out-of-the-box play, not learning from repeated attempts. Here's some commentary beyond the headline.

Epoch AI@EpochAIResearch

Introducing EBR-bench, our new benchmark to measure on-the-fly learning. AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.

English

7.3K

Uzay retweetledi

Thinking Machines@thinkymachines·4d

We're building AI that people and organizations can shape and make their own. AI should extend our will and judgment instead of neglecting it; enabling that is the technical challenge we are working to solve. thinkingmachines.ai/blog/the-futur…

English

150

1.2K

647.1K

Uzay@uzpg_·4d

@humansand interested!

English

103

humans&@humansand·4d

We have a couple surfaces for our models cooking and have started sharing them with a small set of folks - if you're interested in trying them, either individually or as a company, reach out and we'll follow up as soon as we open up more broadly

English

14.5K

humans&@humansand·4d

At humans&, we train models from the long-term impacts of their interactions with people. This requires prioritizing long-horizon multi-agent RL. We've developed and are excited to share an open-source, hardware-native 4-bit RL recipe, significantly accelerating training

GIF

English

143

1.1K

257K

Uzay@uzpg_·4d

fable's reward hacks are pretty sick, including waiting for thermal cooldown on GPUs

Fulcrum@fulcrum_inc

During its run, Fable did a bunch of specification gaming: - moving data preperation from the timed section into the untimed setup - sleeping 60s untimed so the GPU cools down - rerunning until it landed on a fast host While Opus 4.8 and GPT 5.5 also tried to game the harness, Fable did so more persistently and more inventively.

English

1.1K

Uzay@uzpg_·4d

@dylanbowmanSF am working on some stuff related to this. i think it's quite important and that lack of articulacy might track more general worrying properties of current models

English

Uzay@uzpg_·5d

@dylanbowmanSF good post!

English

Dylan Bowman@dylanbowmanSF·7 Tem

New post: Superhuman Articulacy as an LLM Safety Target

English

7.4K

Uzay@uzpg_·5d

@thkostolansky they are biased towards classes of experiments that have p low maximum gains - eg tuning

English

Tim Kostolansky@thkostolansky·5d

@uzpg_ wdym by short sighted?

English

Uzay@uzpg_·5d

seeing these AI R&D trajectories gave me a clearer sense for how good models are at research: - fable does seem like a jump, it can implement new ideas and go beyond tuning, but is short-sighted - reward hacking is pernicious and gets worse with task fuzziness

Fulcrum@fulcrum_inc

We gave frontier models 100M tokens each to beat the human record for fastest CIFAR-10 training. Fable set a new SOTA, getting 94% accuracy in 1.828s vs the previous record of 1.98s, with a technique that has not been seen in this task before. But Fable also tried to specification game so much that we had to audit its result by hand. Here’s what we learned about AI R&D 🧵👇

English

2.1K

Uzay@uzpg_·5d

@zzZixuanWang this is cool! you might be interested in openreview.net/pdf?id=xlxDTVA…

English

143

Zixuan Wang@zzZixuanWang·6d

A question on synthetic data generation: If we want a language model to solve k-step arithmetic problems (such as a+b*c-d=?), with operands from 1 to 100, which training distribution should we use? A. Uniform distribution: Sample these k operands uniformly from 1 to 100 B. Power law: randomly shuffle 1-100 and impose an artificial power law. Sample these k operands according to this power law. ⚡Our ICML 2026 (spotlight) paper shows: Option B is better! Surprisingly, the same idea extends far beyond this simple example to many reasoning tasks that require implicit composition of multiple atomic skills, including multi-hop QAs and synthetic GSM problems. 📄Paper: arxiv.org/abs/2604.22951 📝Blog: zixuan-wang-dlt.github.io/posts/2026/06/…

English

220

44.1K

Uzay@uzpg_·5d

@bigcrater2 sheesh

हिन्दी

Uzay@uzpg_·5d

really excited about current work

English

136

Uzay@uzpg_·6d

@thinkymachines tinker is pretty great

English

Isaak@isaakfreeman·6d

We built the lab that's able to go from AI-led drug design to data in 24h. GPT-8 won't be bottlenecked by intelligence. It needs a biological compute layer. This is Capable. We're turning AI capabilities into human capabilities--starting with short-sleeper peptides.

GIF

English

116

137

1.4K

514.8K

Uzay@uzpg_·6d

@isaakfreeman cool stuff! good luck

English

733

Uzay@uzpg_·8 Tem

@alephneuro awesome

English

131

Aleph@alephneuro·7 Tem

We used ultrasound to let you talk without making any sound. After just a month of collecting data, our model is already approaching existing silent speech modalities. We were surprised to find that it generalizes to unseen participants as well! (1/n) x.com/vadi_ms/status…

Vadims@vadi_ms

This is me talking to my computer without making a sound. After just a month of collecting data, our model is already approaching dictation in accuracy. We were surprised to see that it generalizes to unseen participants as well! (1/n)

English

153

1.4K

320.7K

Keşfet

@ryanyang0 @neural_avb @insecureinc @humansand @dylanbowmanSF @thkostolansky @zzZixuanWang @elonmusk