95

153 posts

95 banner
95

95

@cellestialsea

参加日 Mayıs 2023
150 フォロー中5 フォロワー
95
95@cellestialsea·
@MariusHobbhahn Do you think this is due to leakage or something like eval anxiety?
English
0
0
0
488
95
95@cellestialsea·
@obergggg @CryptoClausX100 @BaddiesReality @HBO You are multiplying 20 per month with 12 months. Compounding grows exponentially not linearly. Its more like %800 (1.2^12) percent for 1 year, square of that for 2 years etc.
English
0
0
0
28
HBO
HBO@HBO·
Can she do a payment plan? #Euphoria
English
116
265
11.6K
10.6M
95
95@cellestialsea·
@yoavgo Some mathematicians and physicist do care about the aesthetics and simplicity more than practicality. I would guess this resonates with them.
English
0
0
0
1.6K
(((ل()(ل() 'yoav))))👾
is the EML result surprising or even interesting to mathematicians? it gives me vibes of some triviality that one would give undergrads to prove in the first problem in a homework problem set rather than "a big thing", but then again i am terrible at judging these things and it might actually be incredibly hard. so which is it?
English
60
5
218
47.9K
zugzwang
zugzwang@drdomicile·
@cellestialsea @teortaxesTex nah bro, opus 4.6 1M was rolled out almost a month later claude.com/blog/1m-contex… oai plans pictured also for reference. there's no way they dont have >=1M 5.4 Pro internally not sure about google tbh, ive basically only ever used their models through the web interface
zugzwang tweet media
English
1
0
1
49
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
DeepSeek has subtly updated the web interface (adding 💎 and ⚡️ symbols) and it seems it has… finally rolled out actual V4. This Expert finally feels different. It still fails AIME26-15, but *because* it tries to cheat and remembers the wrong answer, after finding the real one.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

in fairness, DeepSeek-v3.2-Exp (Think) had one success too. So it's not a big leap. No pre-AIME model solves it. Expert has failed AIME-2026's #15, in almost 2000 seconds. I still believe these are the same model. Too similar, both in verbal patterns and in capabilities.

English
11
12
258
43.4K
95
95@cellestialsea·
@drdomicile @teortaxesTex For the AB tests its possible and logical,and Deepseek also reportedly has compute issues, but afaik opus 4.6 directly came out with 1m rather than being released with 128k.I also dont remember this for any google model though i dont remember that well about releases 2 years ago
English
1
0
1
70
95
95@cellestialsea·
@drdomicile @teortaxesTex Not sure if this is usually the case with Chinese labs but I dont remember this happening a lot with the US labs, oai maybe(?) but they have been a bit ambigious about their context window always anyway. Opus 4.6 was the exact opposite of this though, it went up to 1m, so idk
English
1
0
1
62
zugzwang
zugzwang@drdomicile·
@cellestialsea @teortaxesTex im guessing its a rollout of v4 with a compressed context window? big labs do this often with their flagships
English
1
0
2
73
95
95@cellestialsea·
@teortaxesTex Unless you mean something else by context length?
English
0
0
0
56
95
95@cellestialsea·
@teortaxesTex I thought deepseek was great at the context window stuff, why the regression?
English
3
0
0
478
95
95@cellestialsea·
@tenobrus @segyges Its problematic because alignment isnt defined. We dont know how alignment looks like, the cleanest mapping is reading the model's "goals" but it is not very clear that LLMs do have implicit goals other than predicting the next token aligned to what their corpus would produce
English
1
0
2
49
Tenobrus
Tenobrus@tenobrus·
i think there's a lot of reasonable objections to this: - that it may not be tractable to ever have a theory of alignment without further progress on capabilities. certainly MIRI failed, and there's been infinitely more progress on alignment in the last 5 years now that we have actual models to work with than the prior 20. my opinion is something like if we had paused at GPT-4 we would probably be similarly stuck or at least massively delayed, but the closer to AGI with our current methods the more tractable it is. i think it's very likely given 30 years of the world taking alignment incredibly seriously and as gating potential utopia, we could do this.
English
6
0
25
905
SE Gyges
SE Gyges@segyges·
i will consider supporting an ai pause if pause advocates can tell me under what conditions they would want to see a pause relaxed
SE Gyges@segyges

@SOPHONTSIMP @tenobrus yeah there is literally no specific criteria that the pause is predicated on, and no specific criteria for lifting it. we would basically be waiting for the generation that believed in a pause to get old and die.

English
20
0
50
5.7K
Mark Killingback
Mark Killingback@cubes123·
@mhdcode I run stuff like qwen 122b with most of it in system ram (have 128gb) and as much as will fit in my 16gb of vram. Runs acceptably just about
English
2
0
0
98
MHD
MHD@mhdcode·
sorry local AI folks but i’m not dropping $4k on this high end gpu just to run last year’s models. seriously, GPT-OSS-20B??
MHD tweet media
English
60
1
176
25.3K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
If DeepSeek really has spent a year on unsuccessful/non-scalable experiments due to lacking compute, while the competition was just productively distilling Claude and doing IPOs, it's kind of over at least until new hardware comes online. I really hope this didn't happen
English
12
6
356
18.7K
95
95@cellestialsea·
@eptwts Just incase you are serious, not every smart person wants to maximize their life for wealth, just because its what you want out of life doesnt mean its what everyone wants out of life, not least the smartest people.
English
1
0
1
121
EP
EP@eptwts·
EP tweet media
ZXX
7
2
155
7.8K
95
95@cellestialsea·
@AndrewCurran_ Counter argument is gpt pro. You want good and relatively fast models, very expensive and very slow but godly model currently doesnt have extraordinary demand from the consumers, not sure why a huge model like Mythos would be so much different
English
0
0
3
369
Andrew Curran
Andrew Curran@AndrewCurran_·
There has been a great deal of speculation about why Anthropic is keeping Mythos in restricted release. One of the least-discussed reasons is cost. Not the cost to Anthropic of serving the model, but the downstream effects that cost will have on the industry, and on the world. Mythos is now being served to a small group of about 50 major companies. For organizations like these, token budgets are effectively unlimited, and the opportunity cost of not using as much of the model as possible is too high. I think you can already see the downstream effects even in this limited release. Claude users complain about hitting caps faster. They complain about degraded performance. For months now almost everyone I know has been continuously hitting the cap on Claude or Codex. The existence of Mythos pressures not just the amount of usage available to smaller subscribers, but also the pricing of these plans themselves, which are already subsidized. Smaller users will get hit twice. The compute cost of serving Mythos exerts pressure all the way down the line. Inference will get cheaper over time, but demand is already ahead of that curve and continues to expand. Mythos is not the end of this chain. As long as scale keeps rewarding larger runs, larger models will keep being trained. The next model that makes a Mythos-like jump may be dramatically larger again, and much more expensive to serve. If the cost of serving frontier models continues to outpace attempts to reduce it, then smaller players and public use get squeezed out. We end up with vast models, served at immense cost, available only to the richest corporations on earth. Those firms then use that access to outcompete smaller rivals, become richer still, and widen the gap again. If this continues, a small number of giant companies end up holding the only passports to the Country of Geniuses in a datacenter. For Anthropic, culturally, this is not a desirable world. Part of their reluctance to serve Mythos more broadly comes from a reluctance to help bring this world into being. There may be no way to serve a model like Mythos at scale right now without beginning this feedback loop. And as that loop accelerates, it will generate great resentment. If they serve it to lower-tier subscribers, those users get a handful of exchanges before hitting the cap. Seeing how capable the model is only deepens the resentment, because access is visibly rationed. The labs will be forced to make a trickle-down argument: let the largest firms use the models first, and the abundance will eventually spread to everyone else. The public is unlikely to buy this argument. The hostility and pushback against the industry will spiral. Eventually it may not remain merely political. It is not only Dario who has seen this world, but Sam as well. That is part of why OpenAI has started talking about mechanisms that would give ordinary citizens a direct stake in the upside of the industry, like the Public Wealth Fund. In my opinion the original use case of Worldcoin was a global UBI in a future where OpenAI won the race. Not only is that future no longer certain, but the trust and solidarity required to support a UBI no longer seem to me to exist in the West. The only path then is simply to scale everything as quickly as possible and hope abundance eventually arrives in a cascade strong enough that it reaches everyone on earth. To my friends who are in the safety camp, I understand this argument is hard to accept. Please consider that there is a level of capability beyond which, unless your p(doom) is literally 100, stopping becomes more dangerous than continuing. I think we passed that threshold even before Mythos. Even if stopping were possible - and I personally do not believe it has been for years - stopping here would lock in a dystopia. This dynamic is incentive-driven, just like the race itself, and just as hard to coordinate against. We must not stop inside this tunnel. The only way out is through.
English
49
69
634
53.6K
95
95@cellestialsea·
@MindsAI_Jack @Yeahokay_imsure And as far as I know both the amount of parameters and context window length scales quadratically, but with different variables. So there is an equilibrium where the labs need to make a decision to how smart they want the model to be vs how long the context window should be
English
0
0
0
15
95
95@cellestialsea·
@MindsAI_Jack @Yeahokay_imsure Considering we went from 1K to 1M context window, the intelligence of the models most certainly didnt 1000x (probably)
English
1
0
0
14
Jack Cole
Jack Cole@MindsAI_Jack·
Surprise of the past 4 years: memory is a harder problem than intelligence.
English
9
9
83
16.1K
95
95@cellestialsea·
@BBobbity16 @thomasfbloom There is no actual proof of this being the case. Subscriptions are subsidized compared to the API pricing, but as far as I know nobody has proven that the API pricing itself is subsidized compared to the inference cost of running the models.
English
0
0
0
16
Thomas Bloom
Thomas Bloom@thomasfbloom·
An aspect of using AI to solve maths problems, rarely discussed, is the monetary cost of running these AIs. For example say an Erdős problem is solved by an AI, and the cost of this run is $10,000. 1/
English
32
26
325
84.5K
kat 🌲🕯️
kat 🌲🕯️@capstellium·
my daughter never spoke a full sentence until today. she was holding some toy ducks and said "how many ducks do I have?" and then "hi, my name is joanie" and she's been talking normally in full sentences the rest of the day. it's like she just powered on and realized she's alive
English
88
455
34.9K
1.7M
95
95@cellestialsea·
@pangramlabs I would like to hear the exact reasoning behind why the writing produced by a model will always be clustered together in "embedding space".
English
0
0
0
123
Pangram Labs
Pangram Labs@pangramlabs·
We believe that AI detection will continue to be viable, even in the face of powerful frontier models like Claude Mythos Preview. When any author, human or LLM, writes a piece of text, they're making decisions. Even in the span of 150 words, an author may make hundreds of thousands of conscious and unconscious decisions about word choice, word order, punctuation placement, and sentence structure. Fundamentally, AI detection is a problem of author identification. No matter how sophisticated a particular model gets, it is still a single author making decisions. These decisions are also constrained: assistant models need to produce text that is helpful, clear, and readable. These traits are ingrained in the model via supervised fine-tuning and reinforcement learning. Even the most sophisticated frontier model is still a single structured system, and it will have identifiable habits and quirks. These models also output a lot of text, which means we have a lot of opportunity to learn what kind of decisions they are liable to make. People sometimes frame the problem as though "the statistical distance between human and AI writing is shrinking." This is a mischaracterization of what detection does. AI already writes well enough to pass for human to the untrained eye, as we saw in the NYT quiz a few weeks ago. But the writing produced by a model, like the output of any single author, will always be clustered together in embedding space. This is why we believe that AI detection will continue to be viable, even as models become more and more powerful. Results are looking good. The current Pangram model was able to correctly detect the Mythos Preview short story released in the system card. As long as models are trained systems, we believe detection will remain a solvable problem.
Pangram Labs tweet media
English
14
11
128
9.9K