rihim

5.8K posts

rihim banner
rihim

rihim

@rihim_s

incoming @EverpureData | computer engineering @UCSB class of 2027 | @ucsbNLP | prev: swe intern @Cisco

Beigetreten Kasım 2020
809 Folgt155 Follower
rihim
rihim@rihim_s·
@HououinTyouma i’m getting the opposite now they say all my ideas are dead ends and suck and i should give up and there’s no point in continuing on it
English
0
0
0
18
rihim
rihim@rihim_s·
@xeophon wait this is so smart all the coding harnesses use explorer subagents so it makes sense to fine tune a model for that purpose; could see copilot using this as a default explorer agent. maybe cursor distills their next composer model and fine tunes it for this and same for oai/ant
English
0
0
5
900
Florian Brand
Florian Brand@xeophon·
good stuff from microsoft: 4B model just to explore code bases, cutting token costs by 10-50% (!!!) while the performance of the big model stays the same :)
Florian Brand tweet media
English
15
40
610
23K
rihim
rihim@rihim_s·
@_ueaj @AndrewCurran_ that fully makes sense given how limited the interface is to use llms - just writing text strips so much nuance and taste that knowing how to prompt it really matters a lot. seems like fable had that intuition of understanding built in from the other side as well
English
0
0
2
60
ueaj
ueaj@_ueaj·
I think using LLMs to their fullest is an empathy skill check (i.e. how quickly can you intuit how to communicate/prompt with them) and I consider myself really good at this. So I think I got a uniquely good taste of what the ceiling on performance was and it was obcene. Unbelievable autonomy and understanding on very abstract ideas, very good judgement on vauge things that come up during implementation, etc. Probably 2-3x uplift over opus 4.8 atleast
English
2
0
2
72
rihim
rihim@rihim_s·
@DWestkrew @thesnufki @haider1 not necessarily, it can outperform mythos AND they don’t have to compare it cause it isn’t an accessible model. kinda shady but i can see an argument being made that it’s the best frontier model available
English
0
0
1
9
Haider.
Haider.@haider1·
openai trying to survive june 23rd: release gpt-5.6, call it better than Mythos, and get banned immediately release gpt-5.6 and say Mythos is still better, then watch everyone roast it
English
35
8
288
15.3K
rihim
rihim@rihim_s·
@ThiccQuidity @Kalshi that this time the government banned it in less than a week after release
English
0
0
0
9
ThiccQuidity
ThiccQuidity@ThiccQuidity·
@Kalshi they say this with, literally, every single new model. what else is new?
English
2
0
171
4.7K
Kalshi
Kalshi@Kalshi·
BREAKING: Anthropic says its newest AI model is too powerful to release to the public
English
510
278
3.8K
546.7K
rihim
rihim@rihim_s·
@_ueaj @AndrewCurran_ damn i didn’t use it out of principle cause of the ML research lora nerf but now i wish i had
English
1
0
2
60
ueaj
ueaj@_ueaj·
@AndrewCurran_ to think they had that model ~5 months ago (or atleast a less post-trained version). For the brief time I used it, and even with the frontier flagging, yeah, it's no normal model.
English
2
0
36
3.6K
hope hopes hoping
hope hopes hoping@hopes_revenge·
gender gap monogamy relationships where one person has more gender than the other who also has a monogamy fetish ex mormon biomarkers are so normalized especially when bisexuality is in play or theyfab calico critters cottage core sexual neglect is a gradient dissent
English
21
2
154
6K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
Butthole logo for AI is over, now we have wave-ish logo as the new standard. I don't make the rules.
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
English
21
2
203
18.8K
rihim
rihim@rihim_s·
@teortaxesTex yeah 4.6 was quite special, surprised 4.7 is higher than 4.8; 4.7 seems more rl’d to me
English
0
0
3
204
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Incredible 4.6 does smell the biggest (least RL-degraded) to me this looks like a very sensitive eval. Yes, V4-Pro vs V4-Flash do have a roughly 0.1 Opus' worth of gap in perceived size and capability.
kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

English
8
1
137
14.1K
rihim retweetet
Shannon Sands
Shannon Sands@max_paperclips·
There's a much funnier thing ensembles unlock though (if it's consistent). It doesn't matter if it's inefficient, really. if you can get Mythos by throwing 3 or 4 other weaker models in a trenchcoat, you can distill from the ensemble directly. I wonder, how many Qwens + Kimi's + Deepseeks + GLMs do you need to throw in a trenchcoat to get Mythos quality data? Can you stack enough 9b's to reach heaven?
English
6
7
65
2.3K
rihim
rihim@rihim_s·
@max_paperclips @LokiJulianus @teortaxesTex the only issue is the model providers themselves limiting it lmao; ironically neither of the actual model developers can do it cause they can’t use their model in that way
English
1
0
0
40
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
> And it landed within 1% of Fable 5 while costing roughly half the price. I have reservations about how this generalizes, but do you understand that this implies like 20 times as many tokens? In terms of compute, it's not close either (Fable was overpriced). Well, good cope
OpenRouter@OpenRouter

Notably, the budget panel was comparable with Claude Fable 5 in performance. A panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro, fused together, beat solo GPT-5.5 and solo Opus 4.8 outright. And it landed within 1% of Fable 5 while costing roughly half the price.

English
11
2
102
10.5K
rihim
rihim@rihim_s·
@gfodor speak softly and carry a big stick
English
0
0
0
377
gfodor.id
gfodor.id@gfodor·
Hilariously it was clear that Anthropic convinced the government their models are far ahead. Meanwhile OpenAI drops far bigger godshogs behind dropdowns and CLI switches and points at their free mini smol models like any competent org should who wants to avoid the ban hammer.
English
10
6
387
17.5K
rihim
rihim@rihim_s·
@viemccoy i'd assume the main issue is cost, right? given how inefficient some of the current models are with token output it'd probably cost a multitude more than running the good model (ofc it's extraordinary times rn)
English
1
0
2
318
rihim
rihim@rihim_s·
@mweinbach slightly unrelated but the new ui looks so damn good here, i thought it was a physical glass thing oval at the beginning of the video
English
0
0
3
163
Max Weinbach
Max Weinbach@mweinbach·
Siri can help you do your expenses
English
71
105
3.4K
261.5K
stochasm
stochasm@stochasticchasm·
so when does ai2027 say this is supposed to happen
English
6
5
358
29.9K
rihim
rihim@rihim_s·
@max_spero_ someone who’s a us born citizen at anthropic needs to task mythos on finding a new prime RIGHT NOW
English
0
0
5
457
Max Spero
Max Spero@max_spero_·
Is Claude Fable 5 the newest illegal number?
Max Spero tweet media
English
7
7
450
13.7K
rihim
rihim@rihim_s·
@teortaxesTex oh do you think scale is going to be an important factor going forward? given their low level systems-ish work i’d argue that while more compute might make relatively better models it wouldn’t necessarily help with the fundamental changes needed for the next step change
English
1
0
2
421
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Top 3 labs: Anthropic, OpenAI, DeepSeek. Maybe other Chinese labs qualify in taste but not in ambition. Google, xAI aren't worth the GPUs they've bought, so they're renting it out to Anthropic. Close to Amazon tier, actually. Embarrassing skill issue.
English
38
7
348
21.8K
rihim
rihim@rihim_s·
@threepointone @cursor_ai lowkey same i think fable should be used like 5.5 pro in the sense that it’s not the main agent working on the codebase but the monk in a cave that you turn to when all else fails
English
0
0
0
84
sunil pai
sunil pai@threepointone·
spent all day on fable for a giant PR. ~10kloc, lots of testing and intervention. 250$. I... don't think it's worth it? happy with 4.8/5.5, and the quality of work is better when it's smaler steps. Still rocking @cursor_ai, that's software that I still love using on the daily.
English
42
9
600
114.8K
dope-a-meme in SF
dope-a-meme in SF@aannuujX·
@mufasaYC haha its just gemini with a system prompt, memory and different ui layer🥲
English
8
0
3
1.4K
Mustafa Yusuf
Mustafa Yusuf@mufasaYC·
I did something I never thought I would, I reached out to Siri instead of ChatGPT and honestly it was better and gave me an excellent output with less text which was to the point 🤯
English
12
11
367
108.8K