devansh

1.4K posts

devansh

devansh

@devanshpandey

building aligned general learners. cofounder @si_pbc

San Francisco Katılım Mart 2020
652 Takip Edilen1.6K Takipçiler
devansh retweetledi
sarah guo
sarah guo@saranormous·
watching claude try to use the browser...are websites being adversarial to computer use on purpose? or is CUA still that bad
English
134
9
393
106K
devansh
devansh@devanshpandey·
@tmychow that's like 10m h100 hours - interesting
English
1
0
2
535
trevor (taylor’s version)
"1/4 of the compute spent on the final model came from the base, the rest is from our training" k2 and k2.5 (a continued pretrain of k2) each used 15T tokens k2 is 32B active; by 6np, kimi did 5.76e+24 flops if that's 1/4, cursor did 1.7e25 flops i.e. same as gpt-4 flops
Lee Robinson@leerob

Yep, Composer 2 started from an open-source base! We will do full pretraining in the future. Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training. This is why evals are very different. And yes, we are following the license through our inference partner terms.

English
3
1
49
7.6K
devansh retweetledi
Brian Lovin
Brian Lovin@brian_lovin·
Everyone makes fun of Eight Sleep until it's 84º in SF and they don't have a liquid-cooled mattress. Suckers.
English
61
13
746
344.7K
Rachel Park
Rachel Park@rachelsupark·
nothing gets you feeling the AI <> real world gap more acutely than filing your taxes
English
2
0
28
4.1K
Aaron Scher
Aaron Scher@aaronscher·
I just realized that an "order of magnitude estimate" is the same as having zero sig figs
English
1
0
7
578
devansh retweetledi
Workshop Labs
Workshop Labs@WorkshopLabs·
Introducing Trellis for Kimi K2 Thinking. It's post-training code that's 50x faster than the best single-node open-source version and 2x cheaper than training APIs. After safety testing, we're open-sourcing it, giving builders the best tools to customize a frontier model. 🧵
Workshop Labs tweet media
English
25
67
611
115.9K
catherine ʕ•ᴥ•ʔ-☆
catherine ʕ•ᴥ•ʔ-☆@wilhelmscreamin·
just witnessed not only public making out, but straight people publicly making out, at the local cafe/workspace/vintage clothes store/second hand bookshop/music venue. this place is cooked and it’s so over
English
4
0
41
4.2K
devansh retweetledi
Deedy
Deedy@deedydas·
So many startups think their engineers are "cracked" but have no idea what that really means. This team of 5 19yr olds built a 30 petabyte storage cluster in SF for ~$500k to get a 40x cheaper AWS S3 as a side quest to store 90M hours of video. Now, that's cracked.
Deedy tweet media
English
155
261
5.5K
534K
devansh retweetledi
devansh retweetledi
TBPN
TBPN@tbpn·
Standard Intelligence's @devanshpandey responds to @tszzl's tweet that "text is the universal interface," and explains why their new foundation model is trained on video: "At some point in the arbitrarily long future, if we only use text models, we could force most things to be text. But I think there are just a lot of things that are much more native when done from a computer-use [perspective]." "GUIs are designed for humans to use. We have this massive long tail of things on the internet that are entirely undoable by LLMs." "For example, when I do ML engineering most of my time is spent doing the grunt work of engineering. It's a lot of looking at graphs, analyzing, and comparing loss curves. You can do this in text, but it's a much larger pain than doing it in the native interface." "There's a reason humans don't interact with a computer purely through text, it would kind of suck."
roon@tszzl

text is the universal interface

English
8
9
311
60K
Neel
Neel@awesome_ruler_·
Extremely impressive work, especially for a v1. I’m curious how v2 would be integrating test-time adaptive computation though. I know you guys were exploring latent recursion briefly (UTs styled). Are you still working on that or are you pivoting to more CoT based discrete approaches for the future?
English
1
0
8
453
devansh
devansh@devanshpandey·
there is much work to do. in the next few months we need to train 100x larger models on 200x as much data as we have done before, scale RL to have millions of agents learning things from the internet, and use the computer use prior to build the first models that can scalably learn. there's no team i'd rather be doing this with. if you're interested in being a part of SI's next chapter, i'd be very excited to chat.
Standard Intelligence@si_pbc

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

English
18
13
356
31.8K
Standard Intelligence
Standard Intelligence@si_pbc·
Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.
GIF
English
186
402
3.9K
1.1M
devansh retweetledi
yudhister
yudhister@yudhister_·
we're not adherents to technological determinism and I, at least, do not consider this an unalloyed good. but a competent computer action policy will let us scale & empirically validate intent alignment mechanisms, so I remain cautiously optimistic
English
1
2
26
2.4K
devansh
devansh@devanshpandey·
Yep - OSworld and similar benchmarks tend to be particularly optimized for llm harnesses, instead of trying to capture real-world computer use. Also the model isn't instruct-tuned yet etc, so lots to scale! The benchmark we're most excited about long-term is just our model being actually useful in the real world, and we can just test on stuff like CAD and e.g. universal tab model for computers for that.
English
1
0
3
239
Marco Mascorro
Marco Mascorro@Mascobot·
This is super super cool. Love the idea of going straight to video instead of VLM + screenshot + a11y tree. Curious if you happened to test it on OSWorld/WebArena, etc.? Those benchmarks might be shallow for this, as you could maybe just go straight to games/CAD and test there.
English
2
0
25
2.4K