Bishamon

1.9K posts

Bishamon banner
Bishamon

Bishamon

@9thbeer

passive llm research enjoyer, model is the product

part of your inner monologue Katılım Ağustos 2022
301 Takip Edilen663 Takipçiler
Bishamon retweetledi
Bishamon
Bishamon@9thbeer·
wish someone creates eng + romanji lyrics cover for all anime songs using AI and uploads on YT.
English
0
0
0
31
Bishamon retweetledi
Ethan Mollick
Ethan Mollick@emollick·
I would push back a little: because the models are so good & improving, they don't have to be the product. But it is the model that is the prime mover. If they weren't so generally capable, the harnesses & apps the labs build around them would be hard to build and wouldn't work.
Greg Brockman@gdb

the model alone is no longer the product

English
29
11
307
38.3K
tony the math lion 🦁
tony the math lion 🦁@TonyTheMathLion·
Mathematicians: let's invent ways to avoid local coordinates Et voila, differential geometry is born
English
3
2
52
4.2K
Bishamon
Bishamon@9thbeer·
in information theory, cost of frequent info goes down, same can be applied to network packet unmarshalling, same for REST vs GRPC, LLM prefill, kvcache.
Goshawk Trades@GoshawkTrades

Jane Street's head of technology just explained the full spectrum of how fast their trading decisions are made. the fastest systems turn around a packet in under 100 nanoseconds. at that speed, if you attached an oscilloscope to the wire going in and the wire going out, you'd see the response start to leave before the incoming packet has finished arriving. at that speed, you can't use a CPU. you can't use any programming language. you're on an FPGA direct wired to the network. and the decisions you're making are incredibly simple. because you literally can't compute anything complex in that time. but here's the part most people miss: that's just one end of the spectrum. Jane Street runs an ensemble of systems operating at every timescale simultaneously. some decisions happen in nanoseconds. some in microseconds. some in milliseconds. some take hours or a full day. "the right way to build an optimal trading strategy is an ensemble approach. for some decisions you're making very simple decisions very quickly. for others, you're operating at the scale of microseconds, milliseconds. and in some cases, if you can get that decision turned around in an hour, that's totally fine." the faster you need to respond, the simpler the decision has to be. the slower you can afford to go, the smarter the model can be. this is why "Jane Street is just a speed game" is wrong. speed is one dimension. intelligence is the other.

English
0
0
1
281
Bishamon retweetledi
Goodfire
Goodfire@GoodfireAI·
The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
Goodfire@GoodfireAI

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English
22
147
997
154.7K
Bishamon
Bishamon@9thbeer·
on policy distillation is final_final_v3 form of post-training.
English
1
0
1
94
Bishamon
Bishamon@9thbeer·
@serialdotai on side note, head of data science are now VP of Applied AI.
English
0
0
1
72
serial
serial@serialdotai·
are there still data scientists out there, or is everyone a machine learning engineer?
English
18
0
38
2.8K
Bishamon
Bishamon@9thbeer·
agents using paths similar to URLs for eg. /goal /plan /goal/subagent; may be early sign on how fast and tiny agents could directly serve web requests.
English
0
0
1
66
Bishamon retweetledi
wh
wh@nrehiew_·
Interestingly, albeit unsurprisingly, normal GRPO does not change the representation of the environment-related tokens which is kinda to be expected given they are usually masked out. ECHO naturally does model the environment better. (world modelling)
wh tweet media
English
3
3
13
1.3K
Bishamon retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
The next step toward automating AI is automating RL environments Introducing General-Agent: A fully synthetic environment whose task corpus self-evolves and grows harder over time 4,504 tool-use tasks · 1,040 domains · 8,159 unique tools
GIF
English
48
124
1.3K
284.1K
Bishamon
Bishamon@9thbeer·
@vikramskr overall, do all these books have 50% overlapping content?
English
0
0
0
18
Vikram Sekar
Vikram Sekar@vikramskr·
All my EE books And, no I haven’t read all of them.
Vikram Sekar tweet media
English
25
13
410
17.3K
Bishamon
Bishamon@9thbeer·
DevRel role got upgraded with AI.
Wulfie Bain@wulfie_bain_

Hiring in Bengaluru, India 🇮🇳 for my Startups Applied AI team at @OpenAI. Apply if you want to support the incredible startup ecosystem & shape the future of OpenAI. The team I'm building is already full of ex-founder/CTOs, AI PHDs, MLEs, DSs. We work with frontier startups, and closely with Product & Research. The team works hard, but I can genuinely say we love it. So if you’re obsessed with startups, high agency, & deeply technical - and you like the sound of that team - you should apply or reach out.

English
0
0
0
145
Bishamon retweetledi
Rosmine
Rosmine@rosmine·
I fixed why LLMs write so poorly, and I have a demo to prove it Announcing Distribution Fine Tuning (DFT): A post training step that fixes LLM writing Model outputs fooled pangram on 100% of test cases
Rosmine tweet media
English
122
158
3.2K
442.3K
Bishamon retweetledi
Oxford Mathematics
Oxford Mathematics@OxUniMaths·
Mathematics is a universal language. Isn't it? The Tower of Babel - Episode 1
English
2
15
82
4.5K
Bishamon
Bishamon@9thbeer·
@BushnaqLucius @GoodfireAI accepted. layernorm as solution is now downvoted, let me try different candidate for this arithmetic by rotating shapes behaviour.
English
0
0
0
34
Lucius Bushnaq ⏹️
Lucius Bushnaq ⏹️@BushnaqLucius·
@9thbeer @GoodfireAI Yes, but in a D-dimension residual stream it'd be constraining them to the D-1 dimensional surface of a D dimensional hypersphere. That's a very meaningful difference for D=3, but for D=10,000 not so much.
English
1
0
1
47