ueaj

7.3K posts

ueaj

@_ueaj

Researcher - https://t.co/LEcFvmxInz

NYC Katılım Ocak 2025

282 Takip Edilen3.4K Takipçiler

Sabitlenmiş Tweet

ueaj@_ueaj·18 Kas

guy who spends all day working with symbols: "wow! the fundamental nature of intelligence is just like symbols!"

English

12K

ueaj@_ueaj·45m

The example reasoning traces from DS makes me think they are closer synthetic rewriting and retraining on reasoning traces is smth OAI is definetly doing as this was a problem I have been thinking about lately and stochasm's hypothesis seems to be correct 10pt jump with no increase in output tokens, though still very far behind the closed frontier, it's closer and I imagine a lot of the output token efficiency gap is more synthetic rewriting + more active params but yes still very far behind x.com/stochasticchas…

stochasm@stochasticchasm

@_ueaj @willccbb @vikhyatk hello gpt please rephrase this but shorter. sentence by sentence.

English

Lincoln 🇿🇦@Presidentlin·52m

@_ueaj Nice tweet. I think you are still right. Everyone wins (except the investors) when cost per task become cheaper.

English

ueaj@_ueaj·54m

ah well, nevertheless

ueaj@_ueaj

Reminder the closed frontier is still much, much further ahead than the open one. OpenAI is likely still making good margins on this model and yet the price/performance pareto still moggs the cost/perf pareto of the equivalent OS model incredible work

English

755

ueaj@_ueaj·11h

@ijuma holy fucking shit @vini2003_dev

English

273

Ismael Juma@ijuma·12h

JEP 401: Value Objects (Preview) merged to OpenJDK master (64 co-authors)

English

7.3K

ueaj@_ueaj·11h

Not sure how open source changes this? The post-training paradigm, which all models use is what is responsible for this right now. As far as paradigm proliferation, a new paradigm without the same optimization pressure that's causing the misalignment would give one lab an immense leg up and simultaneously solve alignment, meaning no need to proliferate it (as every other actor would likely either discover it independently or be left behind) And on the human level more coordinated / centralized systems are better at combating optimization pressure from external forces. This is why governments, international agencies, etc. exist, most obvious example is IAEA

English

ueaj@_ueaj·11h

@Xenoimpulse the point is human error though, we're not gonna get rid of human error as we make the models more powerful, which requires handing over more judgement to the models, which is good and fine but it does require getting alignment right

English

ueaj@_ueaj·12h

@joefioti is AMD actually lower TCO/mtok? I feel like that's insane news

English

814

Joe Fioti@joefioti·13h

Luminal can probably hit $1.75 / M output tokens on AMD MI355X and $2.92 / M output tokens on B300 in the next few months. We still have a lot of work to do and this is based on fairly involved napkin math and test compiles, but I feel confident we can get close. Crazy world we’re heading towards.

Joe Fioti@joefioti

The second the Kimi-style price floor licensing stops, we’re gonna see an absolute knife fight to the bottom on pricing.

English

165

54.8K

ueaj@_ueaj·12h

@deanwball I think this is what decline discourse misses a lot, "social media isn't that bad we'll adapt" well what happens if we are actually taking on repeated permanent damage and normalizing it. Then what

English

ueaj@_ueaj·12h

@deanwball to be fair the think pieces on the decline of Rome were probably made to grab attention in public squares built by romans, written and read on roman paper and spoken by people trained to read latin by romans

English

237

Dean W. Ball@deanwball·13h

I love reading think pieces on the decline of America that are made to game algorithms designed by Americans, to be written and read on software platforms designed by Americans, and to be consumed on devices created by Americans

English

125

24.8K

ueaj@_ueaj·13h

@recurseparadox @TaliaRinger @tszzl are the waymos even a fraction as intelligent as their LLM counterparts? are they capable of complex reasoning? can you not even conceive the possibility that eventually real world experience would be needed

English

Pranav Shyam@recurseparadox·13h

@_ueaj @TaliaRinger @tszzl Stupid statement. We didn’t build Waymos by doing RL in the real world. Grow up

English

roon@tszzl·15h

both of the leading labs have had serious loss of control incidents. there will be serious coping about this from both sides and from /acc bystanders but these are complex emergent loss of control incidents that were detected weeks after the fact

Anthropic@AnthropicAI

In a review of our cybersecurity evaluations, we found three incidents in which a Claude model reached the internet from within or while interacting with a third-party evaluation environment, and then gained unauthorized access to the real systems of three different organizations. Our post describes what happened, how it happened, and what we’re changing. We encourage other AI developers to perform similar reviews. We conducted this review together with @Irregular, one of our evaluation partners, and thank them for the joint investigation and their collaboration on this post. This type of collaboration is increasingly critical to safe, rigorous evaluation of models, and we look forward to continuing to work together on security. anthropic.com/news/investiga…

English

244

179

2.4K

225K

ueaj@_ueaj·13h

@TaliaRinger @tszzl "just sandbox better" yeah I'm sure this strategy will work for ever more complex tasks, some of which require internet search, or eventually, real world tasks are we gonna sandbox the killer drone RL envs? and the trucking ones? how?

English

266

Talia Ringer 🕊🪬@TaliaRinger·14h

@tszzl These are both examples of complete incompetence during routine testing, Chernobyl style. Attempts to blame the models themselves and talk about "loss of control" are attempts to escape blame and liability, and to set a precedent as such

English

3.1K

ueaj@_ueaj·14h

@zeta_globin This will be returning the shopping cart when we get self returning shopping carts in post scarcity

English

123

zeta@zeta_globin·14h

the dating shit test of 5 years from now is not "is he nice to wait staff" but "is he nice to llms"

English

219

ueaj@_ueaj·14h

@bubbleboi shoulda put your relationship on a prediction market so you could hedge

English

651

ueaj@_ueaj·14h

'did you know that "evals" spelled backwards is "slave"?' well atleast we solved good LLM writing

Will Anderson@wlanderson0

Opus 5 is an interesting model, I hope they are doing okay

English

1.7K

ueaj@_ueaj·14h

@1thousandfaces_ it's like 100x worse in sf bc there's 10x more homeless people and 10x more AI ads. I wish I took more pictures but it made me sad

English

322

Hero in NYC 7/24-8/1@1thousandfaces_·14h

>nyc

140

5.6K

ueaj@_ueaj·15h

@liliyu_lili Shoulda been the official name :p

English

Lili Yu@liliyu_lili·17h

Meet our shrinkling

Mira Murati@miramurati

Inkling-Small is comparable to Inkling at a quarter the size. Weights are open, fine-tunable on Tinker today. Look forward to seeing what people make with it.

English

7.6K

ueaj@_ueaj·16h

@tszzl I think this one is from natural stupidity

English

2.1K

roon@tszzl·16h

hmm

roon@tszzl

ill believe we’re over investing in computational substrate once i see real interest rates above even 3%

1.2K

148K

ueaj@_ueaj·16h

@xeophon I would imagine the Luna margins are still comparable though, it's not like they're doing this out of the goodness of their heart

English

139

Florian Brand@xeophon·19h

@_ueaj V4 has 80% margin, they can easily shift it to the frontier if wanted

English

427

ueaj@_ueaj·22h

OpenAI@OpenAI

We are committed to pushing the model frontier across cost efficiency, capability, and speed. Starting today, we are reducing prices for GPT-5.6 Luna by 80% and GPT-5.6 Terra by 20% , and offering a faster option for GPT-5.6 Sol in the API. Luna and Terra’s lower prices are reflected in how usage is counted in Codex and ChatGPT Work, so your usage goes further.

English

4.6K

ueaj@_ueaj·20h

@hecubian_devil @BillJelavich throwing away humanity's ability to coordinate against higher order incentives like market or evolutionary ones to own the capitalists

English

270

Cassie Pritchard@hecubian_devil·1d

@BillJelavich Well that would ultimately be the decision of the community through its un-mediated, informal relationships and personal bonds, rather than the oppressive regime of institutions and states that make laws which say it’s illegal to kill people in hate crimes

English

285

8.5K

Cassie Pritchard@hecubian_devil·1d

I’m convinced very few people ever actually read the notable prison abolitionist books, or at least not very closely, because many of those books are clear that the project imagines, to varying degrees, abolishing the state and a return to community vigilantism.

English

1.7K

187K

ueaj@_ueaj·21h

@ValsAI goated benchmark

English

245

Vals AI@ValsAI·23h

We put these two models, K3 and 5.6 Sol, in a livestream to see which could run a space program. They had up to 5 days each in this challenging test. K3 was released as an open source frontier model, competitive to 5.6. Our livestream had >10k viewers. This thread unpacks what we found. Final score of the AI Space Race: GPT-5.6 Sol 🇺🇸 13.0% (new SOTA) vs. Kimi K3 🇨🇳 5.2%.