Dev

448 posts

Dev banner
Dev

Dev

@DevvMandal

nerd | co-founder at @markov__ai | ex @SarvamAI | IITM ‘27

blr Katılım Ocak 2024
704 Takip Edilen5.2K Takipçiler
Sabitlenmiş Tweet
Dev
Dev@DevvMandal·
Today we're launching the most advanced computer-use dataset in the world. 1,000+ hours of screen recordings along with mouse/keyboard inputs + annotations. Sourced from experts across coding, design, browser-use, research and more. Link in the comments :) @markov__ai
Dev@DevvMandal

Today, we're launching the world's largest open-source dataset of computer-use recordings. 10,000+ hours across Salesforce, Blender, Photoshop and more, to automate the next level of white-collar work. Link in the comments :) @markov__ai

English
27
41
374
45.3K
LocalHost India
LocalHost India@localhostIND·
Finding the right soundtrack for a scene is still weirdly hard. Adithya is trying to fix that. Introducing taan, it watches your video and generates music around how the scene actually feels.
English
22
24
175
18.5K
sahil
sahil@_sahildhull·
the input interface has been the same for decades. with ai, software can now reason and act on your behalf but the interface is the bottleneck. why do i have to check sushi on 10 restaurants across 3 apps? why can't i just do it with a flick of a finger?! the world's about to get a new interface @agi_interfaces
English
93
75
267
76.7K
Dev retweetledi
Corbin Rosset
Corbin Rosset@corby_rosset·
How do you tell if a computer use agent actually succeeded? It’s really two questions: did it execute well (process), and did the user actually get what they asked for (outcome)? Introducing the Universal Verifier 🧵
English
3
14
31
3K
Alok Bishoyi
Alok Bishoyi@alokbishoyi97·
for those of you who are autoresearch pilled , or have been meaning to get into autoresearch but dont know how - I shipped evo today - a opensource Claude Code plugin that optimizes code through experiments you hand it a codebase. it finds a benchmark, runs the baseline, then fires off parallel agents to try to beat it. kept if better, discarded if worse. inspired by @karpathy's autoresearch, but with structure on top: - tree search over greedy hill-climb — multiple forks from any committed node - N parallel agents in git worktrees - shared failure traces so agents don't repeat each other's mistakes - regression gates
English
49
71
1.4K
179.8K
Dev
Dev@DevvMandal·
Today we're launching the most advanced computer-use dataset in the world. 1,000+ hours of screen recordings along with mouse/keyboard inputs + annotations. Sourced from experts across coding, design, browser-use, research and more. Link in the comments :) @markov__ai
Dev@DevvMandal

Today, we're launching the world's largest open-source dataset of computer-use recordings. 10,000+ hours across Salesforce, Blender, Photoshop and more, to automate the next level of white-collar work. Link in the comments :) @markov__ai

English
27
41
374
45.3K
Goonal
Goonal@KunalKSavita·
dropping soon
Goonal tweet media
English
5
0
24
785
Dev
Dev@DevvMandal·
@mike64_t this is one of the coolest things i've ever come across
English
0
0
2
105
mike64_t
mike64_t@mike64_t·
I think I can finally report some success training a quite accurate IDM capable of recovering keystrokes from Minecraft gameplay, even in quite PvP-heavy situations. At this point the model does not only know what keys are pressed to the extent reasonably discernible, it also knows how fast it is moving in 3D space at all times, even when knockback is mixing with the self-move impulse. Now, recovering keystrokes from normal external capture footage is just about impossible. E.g. W/A/S/D does exactly nothing during partial tick frames and jumping mid-air is also equally useless, so asking the model to recover key down states is inherently unreasoanble. Mouse deltas are also completely arbitrary units, as game mouse sensitivity introduces an arbitrary scale factor into the equation. The only good option is to think carefully about your model-environment contract, and only record "logical actions", not raw keystrokes. So here's a few unfortunate lessons I had to learn in roughly this order. - Choose good units. (bad: mouse deltas, good: delta radians [yes, you will need game-internal state]) - Capture from inside the main game loop and read the game fbo to get consistent frame-action pairing. Doing post-mortem pairing is hopeless. - Carefully define when you think keystrokes actually have an effect. (jump only works on ground, when flying or in water etc.) More subtle: The key may already be down, but no tick has happened yet to actually use the value. Hence: ignore Seperate gamestate into "fast and slow-moving" components. E.g. movement is likely tick based, camera rotation is very likely updated every frame in essentially every game ever. - Think about your frame-action correspondance contract (How old is the frame in relation to the inputs you capture? Will double or tripple buffering affect you?) Think about the game loop timeline, where you are sampling, how old the data you are reading is, and where the ticks are happening around you. Language models used to simply not have a model-environment contract, but even now with the model "living" in a designated harness, the contract still boils down to formatting, and tool implementation intrinsics. While also important, it is still quite a bit more obvious because the violations are in some way shape or form reflected as text you can actually see. - ffmpeg dropping frames cummulatively screws the model the further you get into the sequence because your targets are now shifted. If you can't encode the video in real-time, too bad. - Sodium has a frames in flight system different from vanilla Minecraft, which will also offset your targets from your frames. (there goes that data...) - Models are succeptible to latency. If there is too big of a delay between action and on-screen reflection, your performance degrades. At this point I realize ~100hours of gameplay is essentially no longer usable as a dataset. You can train on this data, but all you'll get is a mushy mess. However, some good news: - Making the model predict physics gamestate scalars helps the model generalize. For instantaneous events like jump, it's unreasonable to ask the model emit a short burst of jump=true at exactly the right time, however if you also predict your current y-velocity, the model has supervision signal for the "latent" from which that onground jump becomes apparent. Recovering x/z motion is also somewhat easier than unmixing it into plausible keystrokes for inertia-heavy player controller logic. - Regressing physics gamestate scalars also seems to make your dataset "bigger". While pure keystroke classification will overfit quickly, predicting exact physics gamestate scalars forces the model to generalize more and you can tolerate far more epochs before validation loss starts to stall out. This is the only reason why it was bearable to dump 100h+ of dataset hours and replace it with ~3 hours of gameplay after the 4th revision of the file format (yeah...) and somehow still have better performance. Now, you might be asking, "isn't this brittle?" and the answer is yesn't. Frame-action correspondance matters for training, but not so much during inference. So as long as you are sampling in roughly the same interval as your training data, you aren't violating any hard contract per-se. Somewhere around the frames ticks are happening, and during training you capture various tick-capture offset relations per random chance, so nothing is too obviously wrong here. HOWEVER, you will get screwed by gui scale, shaders, resource packs, "shit that recording is 1920x1040 because somebody doesn't know fullscreen exists" and other unfortunate edge cases of reality. But I suppose this is the role of dataset size. If all those "contract violations" that a youtube video has compared to the training data are addressed, I think this is a way to turn Youtube into a labeled dataset. I could never shake the feeling that VPT is a sound idea in practice, while never having been properly executed, and I think one reason why it hasn't is because that label boostrapping part is just a pain in the butt to get right. Now, what the player is doing is of course not the only label you can extract from video, but it has to be one of the targets predicted during pretraining to "align" the pretraining objective. Some notes on the video here, the colored dots on the analog visualizer are the ground truth, while the gray dot is the model prediction. Green means correct prediction, red means incorrect prediction at that frame. Model P(key) reports how wrong the prediction is from green (0.0) to red (1.0). You will also notice that during periods of rapid slow down, left and right actions become close to irrecoverable, because there is just that little motion. And some jump actions are not predicted correctly because I got the detection condition for jump events wrong... (duh) LMB/RMB for other than sustained events (like item-consume and block break) also seem to be hopelessly irrecoverable for now. Swing was supposed to do the same thing as motion y did for jump, but its too well behaved as an increasing counter. Maybe partial-tick interpolated values work better (v5 file format then... ugh..)
English
17
15
268
18.7K
Dev
Dev@DevvMandal·
@KunalKSavita Don toliver next to claude code is diabolical
English
0
0
1
120
Goonal
Goonal@KunalKSavita·
over the last few days, i’ve been working on something really cool. dropping tomorrow :)
Goonal tweet media
English
6
0
25
978
Dev
Dev@DevvMandal·
computer-use is the natural first application of training ai from video first - there's huge economic value in automating enterprise workflows second - large scale demonstrations already exist on the internet third - the constraints are favourable for RL - action space is only mouse and keyboard - whole state is captured on the screen - feedback loops are tight, any action you/the agent makes, there's an immediate change in the ui - (mostly) verifiable rewards we have so much cooking at @markov__ai :)
Dev@DevvMandal

Today, we're launching the world's largest open-source dataset of computer-use recordings. 10,000+ hours across Salesforce, Blender, Photoshop and more, to automate the next level of white-collar work. Link in the comments :) @markov__ai

English
5
8
81
8.2K
Tushar Goswamy
Tushar Goswamy@TusharGoswamy·
The last 2 months have been exciting in AI developments and for my personal journey of contributing to @SarvamAI 's mission to build sovereign AI for India. Sharing some interesting updates below: (1/6)
Tushar Goswamy tweet mediaTushar Goswamy tweet mediaTushar Goswamy tweet mediaTushar Goswamy tweet media
English
7
3
96
4.8K
Ameya
Ameya@Finstor85·
@DevvMandal @markov__ai Great work Dev!! This is phenomenal. I have some questions. What's best way to reach out?
English
1
0
1
957
Dev
Dev@DevvMandal·
Today, we're launching the world's largest open-source dataset of computer-use recordings. 10,000+ hours across Salesforce, Blender, Photoshop and more, to automate the next level of white-collar work. Link in the comments :) @markov__ai
English
91
198
1.8K
457.5K
Abhishay
Abhishay@abhishaygg·
hi tech twitter i'm abhishay - and this is the story of how i accidently found my way into the tech world.
English
48
4
244
9.1K