devansh

1.4K posts

devansh

@devanshpandey

building aligned general learners. cofounder @si_pbc

San Francisco Katılım Mart 2020

652 Takip Edilen1.6K Takipçiler

devansh retweetledi

sarah guo@saranormous·1d

watching claude try to use the browser...are websites being adversarial to computer use on purpose? or is CUA still that bad

English

134

393

106K

devansh@devanshpandey·3d

@tmychow that's like 10m h100 hours - interesting

English

535

trevor (taylor’s version)@tmychow·3d

"1/4 of the compute spent on the final model came from the base, the rest is from our training" k2 and k2.5 (a continued pretrain of k2) each used 15T tokens k2 is 32B active; by 6np, kimi did 5.76e+24 flops if that's 1/4, cursor did 1.7e25 flops i.e. same as gpt-4 flops

Lee Robinson@leerob

Yep, Composer 2 started from an open-source base! We will do full pretraining in the future. Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training. This is why evals are very different. And yes, we are following the license through our inference partner terms.

English

7.6K

devansh retweetledi

Anjney Midha@AnjneyMidha·4d

amppublic.com

ZXX

476

210.5K

devansh retweetledi

Brian Lovin@brian_lovin·4d

Everyone makes fun of Eight Sleep until it's 84º in SF and they don't have a liquid-cooled mattress. Suckers.

English

746

344.7K

devansh@devanshpandey·6d

@rachelsupark soon :)

English

119

Rachel Park@rachelsupark·17 Mar

nothing gets you feeling the AI <> real world gap more acutely than filing your taxes

English

4.1K

devansh@devanshpandey·6d

@aaronscher underrated tweet

English

Aaron Scher@aaronscher·17 Mar

I just realized that an "order of magnitude estimate" is the same as having zero sig figs

English

578

devansh retweetledi

Workshop Labs@WorkshopLabs·13 Mar

Introducing Trellis for Kimi K2 Thinking. It's post-training code that's 50x faster than the best single-node open-source version and 2x cheaper than training APIs. After safety testing, we're open-sourcing it, giving builders the best tools to customize a frontier model. 🧵

English

611

115.9K

devansh@devanshpandey·10 Mar

very cool work!!

Workshop Labs@WorkshopLabs

Open weights isn't open training. @AddieF38654 on our team wrote up her experience trying to post-train a 1T parameter MoE model using the existing open source infra. Let's find out how many monkey-patches it takes to post-train an open-weights model. A thread🧵

English

2.3K

devansh@devanshpandey·7 Mar

@wilhelmscreamin @Mjreard @ChanaMessinger oh that's so tragic :(

English

110

catherine ʕ•ᴥ•ʔ-☆@wilhelmscreamin·7 Mar

@Mjreard @ChanaMessinger the Amerification of oxford is in progress and its dread tides cannot be slowed

English

417

catherine ʕ•ᴥ•ʔ-☆@wilhelmscreamin·7 Mar

just witnessed not only public making out, but straight people publicly making out, at the local cafe/workspace/vintage clothes store/second hand bookshop/music venue. this place is cooked and it’s so over

English

4.2K

devansh retweetledi

Deedy@deedydas·5 Mar

So many startups think their engineers are "cracked" but have no idea what that really means. This team of 5 19yr olds built a 30 petabyte storage cluster in SF for ~$500k to get a 40x cheaper AWS S3 as a side quest to store 90M hours of video. Now, that's cracked.

English

155

261

5.5K

534K

devansh retweetledi

Alfred Lin@Alfred_Lin·4 Mar

PI's robot can now make a grilled cheese without burning it. It has thus passed the Alfred Test, a higher bar than the Turing Test, because I still cannot do that reliably.

Physical Intelligence@physical_int

We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇

English

465

52.7K

devansh retweetledi

TBPN@tbpn·25 Şub

Standard Intelligence's @devanshpandey responds to @tszzl's tweet that "text is the universal interface," and explains why their new foundation model is trained on video: "At some point in the arbitrarily long future, if we only use text models, we could force most things to be text. But I think there are just a lot of things that are much more native when done from a computer-use [perspective]." "GUIs are designed for humans to use. We have this massive long tail of things on the internet that are entirely undoable by LLMs." "For example, when I do ML engineering most of my time is spent doing the grunt work of engineering. It's a lot of looking at graphs, analyzing, and comparing loss curves. You can do this in text, but it's a much larger pain than doing it in the native interface." "There's a reason humans don't interact with a computer purely through text, it would kind of suck."

roon@tszzl

text is the universal interface

English

311

60K

devansh retweetledi

TBPN@tbpn·24 Şub

Happy Tuesday. Big show today: - @patrickc (Stripe) - @collision (Stripe) - @bgurley (Benchmark) - @ivanhzhao (Notion) - @StefanoErmon (Inception AI) - @thejamescad (Profound) - @RuneKvist (AIUC) - @reinerpope (MatX) - @devanshpandey (SI) See you on the stream.

TBPN@tbpn

x.com/i/broadcasts/1…

English

64.6K

devansh@devanshpandey·24 Şub

@awesome_ruler_ indeed still working on smth like that!

English

319

Neel@awesome_ruler_·24 Şub

Extremely impressive work, especially for a v1. I’m curious how v2 would be integrating test-time adaptive computation though. I know you guys were exploring latent recursion briefly (UTs styled). Are you still working on that or are you pivoting to more CoT based discrete approaches for the future?

English

453

devansh@devanshpandey·24 Şub

there is much work to do. in the next few months we need to train 100x larger models on 200x as much data as we have done before, scale RL to have millions of agents learning things from the internet, and use the computer use prior to build the first models that can scalably learn. there's no team i'd rather be doing this with. if you're interested in being a part of SI's next chapter, i'd be very excited to chat.

Standard Intelligence@si_pbc

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

English

356

31.8K

devansh@devanshpandey·24 Şub

shout out to the trusty heap for storing all the video data needed to train FDM-1. tbh it might have more uptime than AWS in the past year 😛

Standard Intelligence@si_pbc

our S3 bill was way too expensive… so we built a 30 PB storage cluster in the heart of SF.

English

6.1K

devansh@devanshpandey·24 Şub

@emilyhanyf @si_pbc @_neelr_ <33

Emily Han@emilyhanyf·24 Şub

@si_pbc congrats goats @_neelr_ @devanshpandey !!!

English

597

Standard Intelligence@si_pbc·23 Şub

GIF

English

186

402

3.9K

1.1M

devansh retweetledi

yudhister@yudhister_·24 Şub

we're not adherents to technological determinism and I, at least, do not consider this an unalloyed good. but a competent computer action policy will let us scale & empirically validate intent alignment mechanisms, so I remain cautiously optimistic

English

2.4K

devansh retweetledi

roon@tszzl·24 Şub

feels like a pivotal moment for realtime

Standard Intelligence@si_pbc

English

201.1K

devansh@devanshpandey·24 Şub

@natalydelcid10 @si_pbc ❤️❤️

QME

devansh@devanshpandey·24 Şub

Yep - OSworld and similar benchmarks tend to be particularly optimized for llm harnesses, instead of trying to capture real-world computer use. Also the model isn't instruct-tuned yet etc, so lots to scale! The benchmark we're most excited about long-term is just our model being actually useful in the real world, and we can just test on stuff like CAD and e.g. universal tab model for computers for that.

English

239

Marco Mascorro@Mascobot·24 Şub

This is super super cool. Love the idea of going straight to video instead of VLM + screenshot + a11y tree. Curious if you happened to test it on OSWorld/WebArena, etc.? Those benchmarks might be shallow for this, as you could maybe just go straight to games/CAD and test there.

English

2.4K

Keşfet

@tmychow @rachelsupark @aaronscher @wilhelmscreamin @Mjreard @ChanaMessinger @tszzl @patrickc