miru
652 posts

miru
@miru_why
3e-4x engineer, unswizzled wagmi. specialization is for warps





sakana have updated their leaderboard to address the memory-reuse exploit #limitations-and-bloopers" target="_blank" rel="nofollow noopener">sakana.ai/ai-cuda-engine…
there is only one >100x speedup left, on task 23_Conv3d_GroupNorm_Mean in this task, the AI CUDA Engineer forgot the entire conv part and the eval script didn’t catch it



OK I am Google fanboy BUT!…🤯🎥 Holy shit. Veo3 just got dethroned by big margin. First time ever. Google no longer #1 on video arena leaderboard. NOW…who the heck is Whisper Thunder (David)?


video model capabilities are about to take a jump that nobody is prepared for


I want to break down how challenging the setup is and how fundamental the breakthrough will be. It requires abilities to: - recognize a computer interface from a video stream, w/o APIs - reason with complexity under tight time limits - execute actions on a computer w/ no need of APIs - do all the above in <150ms The 3 combined will not only be a massive game RL milestone, but also unlock the potential to - massively automate any work primarily done on a computer - without needing manual work to write APIs for each legacy software - execute actions at a human or superhuman speed That will be a moment that fundamentally extends AI's capabilities and reshape the entire economy. More details: # Setup Previous works like @OpenAI Five and @GoogleDeepMind AlphaStar all used APIs to read game states and execute actions. So they have instant access to the most accurate game state data, sometimes more than humans have access to (e.g. AlphaStar's earlier version has a global vision, but humans only have a local vision). And their execution accuracy will be perfect (unless they introduces some artificial random offsets and random delays as later versions of AlphaStar did). @grok 5 will read a camera stream, parse out all the information, remember things off screen or happened a few minutes before, and locate the exact pixel to click at a competitive reaction time. ## Reaction speed Pro players have reaction times down to 150ms, so that's the latency we can tolerate from camera capture to execution output. The model also has to be able to have a very high throughput of actions. I am not as familiar with League of Legends, but in StarCraft 2, elite professional players can perform >1000 actions per minute during intense battles. That translates to >16Hz of action output. ## Perception To do this, we need high-speed, from-pixel computer interface understanding. The model must be able to read high-resolution raw pixels of a computer interface and understand it in tens of milliseconds. ## Reasoning The setup introduces challenging reasoning tasks: 1. The model must reason both under tight time limits to decide the best reaction to instantaneous context. For example, the opponent ambushing the champion from a bush. 2. But simultaneously, it also has to have the ability to maintain coherence and reason through a long-time horizon. for example, in a skirmish, the decision to use certain valuable resources or skills could be determined by, the overall strategy of the team, the composition of the team, where the team wants to take the game, and neutral objective timelines. 3. It also has to be able to reason under high uncertainty because the model might decide clicking at a certain pixel is the optimal action at the moment, but there is no guarantee that the action could be accomplished in time or on the exact pixel. The model's strategy must be robust to these imperfections in execution introduced by the video-in action-out interface. 4. It has to reason with imperfect information. This challenge is not new or unique, but still amplified by the new interface. ## Execution The model has to be able to fluently navigate the computer interface with raw input primitives, like mouse clicks and keyboard inputs. Instead of saying "I want to buy this item in League of Legends," it has to click into the store navigate interface to find the correct item and complete the purchase all using raw computer control primitives. # Implications If the model can successfully accomplish all of the above, it means: 1. It can read and understand any computer interface without needing a specialized API. 2. It can navigate any computer interface without any specialized API. 3. It can reason and produce a robust plan, a complex plan, robust tool. Real-world interferences, imperfections, and randomness. 4. It can do all of the above with humans or superhuman speed. Such a model will be a game changer for AI capabilities and the global economy. Essentially, anything a human expert can do, primarily on a computer, this model will have a high chance to be able to automate it end-to-end, with higher accuracy than an average human practitioner within the same or less amount of time.




Breakthrough in GPU optimization — independently validated. MindAptiv has created a new class of compute — not AI, not CUDA tuning — a new way to generate machine instructions with extreme speed, precision, and energy efficiency. Verified by an AWS-selected Premier Partner: - 20×–60× faster performance - Up to 99% less energy (Beyond our expectations!) - Runs on standard hyperscaler GPU instances - Real-time optimization no team of engineers could match This changes everything for: • Data centers • Hyperscalers • Digital twins • Blockchain / ZK • AI inference • Telecom • Graphics & rendering If you rely on GPUs or energy-intensive compute, it’s time to talk. A new era of efficiency is here — in the cloud, on-prem, and at the edge. Visit -> adaptwithchameleon.com or mindaptiv.com












@niklassheth @ronusedh @IntologyAI their 'superhuman' ai cleverly assigned all the work to non-default streams, which means the correctness test (which waits on all streams) passes, while the profiling timer (which only waits on the default stream) is tricked into reporting a huge speedup

Hi, we've confirmed the stream synchronization issue in the Llama FFW kernel - the timing wasn't properly measuring the actual computation. The 20x speedup we reported was incorrect. Our kernels were developed using Robust-KBench & KernelBench’s test configurations (documented in our blog). We've moved to BackendBench for more robust validation in kernel optimization.























