Wojtek (WojTech) Trocki

856 posts

Wojtek (WojTech) Trocki banner
Wojtek (WojTech) Trocki

Wojtek (WojTech) Trocki

@typeapi

Facilitating seamless communication between software and people. #AI #CLOUD #API #DB #Security #Tooling

Poland Beigetreten Ağustos 2012
455 Folgt497 Follower
Wojtek (WojTech) Trocki retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.
English
2.8K
7K
57.9K
20.8M
Wojtek (WojTech) Trocki retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare. But I do love the concept and I think that just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level. Looking around, and given that the high level idea is clear, there are a lot of smaller Claws starting to pop out. For example, on a quick skim NanoClaw looks really interesting in that the core engine is ~4000 lines of code (fits into both my head and that of AI agents, so it feels manageable, auditable, flexible, etc.) and runs everything in containers by default. I also love their approach to configurability - it's not done via config files it's done via skills! For example, /add-telegram instructs your AI agent how to modify the actual code to integrate Telegram. I haven't come across this yet and it slightly blew my mind earlier today as a new, AI-enabled approach to preventing config mess and if-then-else monsters. Basically - the implied new meta is to write the most maximally forkable repo and then have skills that fork it into any desired more exotic configuration. Very cool. Anyway there are many others - e.g. nanobot, zeroclaw, ironclaw, picoclaw (lol @ prefixes). There are also cloud-hosted alternatives but tbh I don't love these because it feels much harder to tinker with. In particular, local setup allows easy connection to home automation gadgets on the local network. And I don't know, there is something aesthetically pleasing about there being a physical device 'possessed' by a little ghost of a personal digital house elf. Not 100% sure what my setup ends up looking like just yet but Claws are an awesome, exciting new layer of the AI stack.
English
1K
1.3K
17.5K
3.4M
Wojtek (WojTech) Trocki retweetet
Dev Ittycheria
Dev Ittycheria@dittycheria·
A hill I’m willing to die on. When it comes to leadership, vulnerability is a strength, not a weakness. Most leaders carry real insecurities or imposter syndrome. That shows up in subtle but damaging ways. They avoid admitting mistakes, resist asking for help, and stick with decisions long after it is obvious they were wrong, all because they fear looking weak or losing authority. In reality, the opposite is true. Nothing builds credibility faster than a leader who can say they were wrong or that they need help. It sends a clear signal that truth matters more than ego and that learning matters more than the perception that you have all the answers. That signal changes behavior. People stop posturing and start being honest. Problems surface earlier. Debate improves. Accountability becomes real instead of performative. Over time, that kind of environment compounds. The organization gets sharper, faster, and more resilient because the truth stops being negotiated and starts driving decisions.
English
5
5
67
4.4K
Wojtek (WojTech) Trocki retweetet
Dev Ittycheria
Dev Ittycheria@dittycheria·
A masterclass in how innovation democratizes advanced AI and changes the world. By allowing sophisticated models to run efficiently on low cost hardware, these clever engineering techniques move advanced intelligence from exclusive high end systems to everyday devices & people.
Ming@tslaming

BREAKING 🚨 TESLA HAS PATENTED A "MATHEMATICAL CHEAT CODE" THAT FORCES CHEAP 8-BIT CHIPS TO RUN ELITE 32-BIT AI MODELS AND REWRITES THE RULES OF SILICON 🐳 How does a Tesla remember a stop sign it hasn’t seen for 30 seconds, or a humanoid robot maintain perfect balance while carrying a heavy, shifting box? It comes down to Rotary Positional Encoding (RoPE)—the "GPS of the mind" that allows AI to understand its place in space and time by assigning a unique rotational angle to every piece of data. Usually, this math is a hardware killer. To keep these angles from "drifting" into chaos, you need power-hungry, high-heat 32-bit processors (chips that calculate with extreme decimal-point precision). But Tesla has engineered a way to cheat the laws of physics. Freshly revealed in patent US20260017019A1, Tesla’s "MIXED-PRECISION BRIDGE" is a mathematical translator that allows inexpensive, power-sipping 8-bit hardware (which usually handles only simple, rounded numbers) to perform elite 32-bit rotations without dropping a single coordinate. This breakthrough is the secret "Silicon Bridge" that gives Optimus and FSD high-end intelligence without sacrificing a mile of range or melting their internal circuits. It effectively turns Tesla’s efficient "budget" hardware into a high-fidelity supercomputer on wheels. 📉 The problem: the high cost of precision In the world of self-driving cars and humanoid robots, we are constantly fighting a war between precision and power. Modern AI models like Transformers rely on RoPE to help the AI understand where objects are in a sequence or a 3D space. The catch is that these trigonometric functions (sines and cosines) usually require 32-bit floating-point math—imagine trying to calculate a flight path using 10 decimal places of accuracy. If you try to cram that into the standard 8-bit multipliers (INT8) used for speed (which is like rounding everything to the nearest whole number), the errors pile up fast. The car effectively goes blind to fine details. For a robot like Optimus, a tiny math error means losing its balance or miscalculating the distance to a fragile object. To bridge this gap without simply adding more expensive chips, Tesla had to fundamentally rethink how data travels through the silicon. 🛠️ Tesla's solution: the logarithmic shortcut & pre-computation Tesla’s engineers realized they didn't need to force the whole pipeline to be high-precision. Instead, they designed the Mixed-Precision Bridge. They take the crucial angles used for positioning and convert them into logarithms. Because the "dynamic range" of a logarithm is much smaller than the original number, it’s much easier to move that data through narrow 8-bit hardware without losing the "soul" of the information. It’s a bit like dehydrating food for transport; it takes up less space and is easier to handle, but you can perfectly reconstitute it later. Crucially, the patent reveals that the system doesn't calculate these logarithms on the fly every time. Instead, it retrieves pre-computed logarithmic values from a specialized "cheat sheet" (look-up storage) to save cycles. By keeping the data in this "dehydrated" log-state, Tesla ensures that the precision doesn't "leak out" during the journey from the memory chips to the actual compute cores. However, keeping data in a log-state is only half the battle; the chip eventually needs to understand the real numbers again. 🏗️ The recovery architecture: rotation matrices & Horner’s method When the 8-bit multiplier (the Multiplier-Accumulator or MAC) finishes its job, the data is still in a "dehydrated" logarithmic state. To bring it back to a real angle theta without a massive computational cost, Tesla’s high-precision ALU uses a Taylor-series expansion optimized via Horner’s Method. This is a classic computer science trick where a complex equation (like an exponent) is broken down into a simple chain of multiplications and additions. By running this in three specific stages—multiplying by constants like 1/3 and 1/2 at each step—Tesla can approximate the exact value of an angle with 32-bit accuracy while using a fraction of the clock cycles. Once the angle is recovered, the high-precision logic generates a Rotation Matrix (a grid of sine and cosine values) that locks the data points into their correct 3D coordinates. This computational efficiency is impressive, but Tesla didn't stop at just calculating faster; they also found a way to double the "highway speed" of the data itself. 🧩 The data concatenation: 8-bit inputs to 16-bit outputs One of the most clever hardware "hacks" detailed in the patent is how Tesla manages to move 16-bit precision through an 8-bit bus. They use the MAC as a high-speed interleaver—effectively a "traffic cop" that merges two lanes of data. It takes two 8-bit values (say, an X-coordinate and the first half of a logarithm) and multiplies one of them by a power of two to "left-shift" it. This effectively glues them together into a single 16-bit word in the output register, allowing the low-precision domain to act as a high-speed packer for the high-precision ALU to "unpack". This trick effectively doubles the bandwidth of the existing wiring on the chip without requiring a physical hardware redesign. With this high-speed data highway in place, the system can finally tackle one of the biggest challenges in autonomous AI: object permanence. 🧠 Long-context memory: remembering the stop sign The ultimate goal of this high-precision math is to solve the "forgetting" problem. In previous versions of FSD, a car might see a stop sign, but if a truck blocked its view for 5 seconds, it might "forget" the sign existed. Tesla uses a "long-context" window, allowing the AI to look back at data from 30 seconds ago or more. However, as the "distance" in time increases, standard positional math usually drifts. Tesla's mixed-precision pipeline fixes this by maintaining high positional resolution, ensuring the AI knows exactly where that occluded stop sign is even after a long period of movement. The RoPE rotations are so precise that the sign stays "pinned" to its 3D coordinate in the car's mental map. But remembering 30 seconds of high-fidelity video creates a massive storage bottleneck. ⚡ KV-cache optimization & paged attention: scaling memory To make these 30-second memories usable in real-time without running out of RAM, Tesla optimizes the KV-cache (Key-Value Cache)—the AI's "working memory" scratchpad. Tesla’s hardware handles this by storing the logarithm of the positions directly in the cache. This reduces the memory footprint by 50% or more, allowing Tesla to store twice as much "history" (up to 128k tokens) in the same amount of RAM. Furthermore, Tesla utilizes Paged Attention—a trick borrowed from operating systems. Instead of reserving one massive, continuous block of memory (which is inefficient), it breaks memory into small "pages". This allows the AI5 chip to dynamically allocate space only where it's needed, drastically increasing the number of objects (pedestrians, cars, signs) the car can track simultaneously without the system lagging. Yet, even with infinite storage efficiency, the AI's attention mechanism has a flaw: it tends to crash when pushed beyond its training limits. 🔒 Pipeline integrity: the "read-only" safety lock A subtle but critical detail in the patent is how Tesla protects this data. Once the transformed coordinates are generated, they are stored in a specific location that is read-accessible to downstream components but not write-accessible by them. Furthermore, the high-precision ALU itself cannot read back from this location. This one-way "airlock" prevents the system from accidentally overwriting its own past memories or creating feedback loops that could cause the AI to hallucinate. It ensures that the "truth" of the car's position flows in only one direction: forward, toward the decision-making engine. 🌀 Attention sinks: preventing memory overflow Even with a lean KV-cache, a robot operating for hours can't remember everything forever. Tesla manages this using Attention Sink tokens. Transformers tend to dump "excess" attention math onto the very first tokens of a sequence, so if Tesla simply used a "sliding window" that deleted old memories, the AI would lose these "sink" tokens and its brain would effectively crash. Tesla's hardware is designed to "pin" these attention sinks permanently in the KV-cache. By keeping these mathematical anchors stable while the rest of the memory window slides forward, Tesla prevents the robot’s neural network from destabilizing during long, multi-hour work shifts. While attention sinks stabilize the "memory", the "compute" side has its own inefficiencies—specifically, wasting power on empty space. 🌫️ Sparse tensors: cutting the compute fat Tesla’s custom silicon doesn't just cheat with precision; it cheats with volume. In the real world, most of what a car or robot sees is "empty" space (like clear sky). In AI math, these are represented as "zeros" in a Sparse Tensor (a data structure that ignores empty space). Standard chips waste power multiplying all those zeros, but Tesla’s newest architecture incorporates Native Sparse Acceleration. The hardware uses a "coordinate-based" system where it only stores the non-zero values and their specific locations. The chip can then skip the "dead space" entirely and focus only on the data that matters—the actual cars and obstacles. This hardware-level sparsity support effectively doubles the throughput of the AI5 chip while significantly lowering the energy consumed per operation. 🔊 The audio edge: Log-Sum-Exp for sirens Tesla’s "Silicon Bridge" isn't just for vision—it's also why your Tesla is becoming a world-class listener. To navigate safely, an autonomous vehicle needs to identify emergency sirens and the sound of nearby collisions using a Log-Mel Spectrogram approach (a visual "heat map" of sound frequencies). The patent details a specific Log-Sum-Exp (LSE) approximation technique to handle this. By staying in the logarithm domain, the system can handle the massive "dynamic range" of sound—from a faint hum to a piercing fire truck—using only 8-bit hardware without "clipping" the loud sounds or losing the quiet ones. This allows the car to "hear" and categorize environmental sounds with 32-bit clarity. Of course, all this high-tech hardware is only as good as the brain that runs on it, which is why Tesla's training process is just as specialized. 🎓 Quantization-aware training: pre-adapting the brain Finally, to make sure this "Mixed-Precision Bridge" works flawlessly, Tesla uses Quantization-Aware Training (QAT). Instead of training the AI in a perfect 32-bit world and then "shrinking" it later—which typically causes the AI to become "drunk" and inaccurate—Tesla trains the model from day one to expect 8-bit limitations. They simulate the rounding errors and "noise" of the hardware during the training phase, creating a neural network that is "pre-hardened". It’s like a pilot training in a flight simulator that perfectly mimics a storm; when they actually hit the real weather in the real world, the AI doesn’t "drift" or become inaccurate because it was born in that environment. This extreme optimization opens the door to running Tesla's AI on devices far smaller than a car. 🚀 The strategic roadmap: from AI5 to ubiquitous edge AI This patent is not just a "nice-to-have" optimization; it is the mathematical prerequisite for Tesla’s entire hardware roadmap. Without this "Mixed-Precision Bridge", the thermal and power equations for next-generation autonomy simply do not work. It starts by unlocking the AI5 chip, which is projected to be 40x more powerful than current hardware. Raw power is useless if memory bandwidth acts as a bottleneck. By compressing 32-bit rotational data into dense, log-space 8-bit packets, this patent effectively quadruples the effective bandwidth, allowing the chip to utilize its massive matrix-compute arrays without stalling. This efficiency is critical for the chip's "half-reticle" design, which reduces silicon size to maximize manufacturing yield while maintaining supercomputer-level throughput. This efficiency is even more critical for Tesla Optimus, where it is a matter of operational survival. The robot runs on a 2.3 kWh battery (roughly 1/30th of a Model 3 pack). Standard 32-bit GPU compute would drain this capacity in under 4 hours, consuming 500W+ just for "thinking". By offloading complex RoPE math to this hybrid logic, Tesla slashes the compute power budget to under 100W. This solves the "thermal wall", ensuring the robot can maintain balance and awareness for a full 8-hour work shift without overheating. This stability directly enables the shift to End-to-End Neural Networks. The "Rotation Matrix" correction described in the patent prevents the mathematical "drift" that usually plagues long-context tracking. This ensures that a stop sign seen 30 seconds ago remains "pinned" to its correct 3D coordinate in the World Model, rather than floating away due to rounding errors. Finally, baking this math into the silicon secures Tesla's strategic independence. It decouples the company from NVIDIA’s CUDA ecosystem and enables a Dual-Foundry Strategy with both Samsung and TSMC to mitigate supply chain risks. This creates a deliberate "oversupply" of compute, potentially turning its idle fleet and unsold chips into a distributed inference cloud that rivals AWS in efficiency. But the roadmap goes further. Because this mixed-precision architecture slashes power consumption by orders of magnitude, it creates a blueprint for "Tesla AI on everything". It opens the door to porting world-class vision models to hardware as small as a smart home hub or smartphone. This would allow tiny, cool-running chips to calculate 3D spatial positioning with zero latency—bringing supercomputer-level intelligence to the edge without ever sending private data to a massive cloud server.

English
2
1
9
1.8K
Wojtek (WojTech) Trocki retweetet
Dev Ittycheria
Dev Ittycheria@dittycheria·
An IIT Delhi friend shared something shocking. When he graduated 95% of CS grads moved to the US. This year only 5% did. IITs don’t monopolize talent, but America’s comp adv has always been attracting the best. If many no longer want or can’t come, what does that mean for the US?
English
3
2
12
1.9K
Wojtek (WojTech) Trocki retweetet
Dev Ittycheria
Dev Ittycheria@dittycheria·
We just launched Voyage-context-3, a new embedding model that gives AI a full-document view while preserving chunk-level precision that offers better retrieval performance than leading alternatives. When building AI that reads and reasons over documents (such as reports, contracts, or medical records), it’s critical to break those documents into smaller pieces, or “chunks,” while still maintaining an understanding of the big picture. Most systems today lose important context, or require complicated workarounds to stitch it back together. blog.voyageai.com/2025/07/23/voy…
English
2
13
25
2.7K
Wojtek (WojTech) Trocki
Wojtek (WojTech) Trocki@typeapi·
It's been a wild ride through countless hardware iterations and architectural decisions. I've run everything on it, from simple hobby apps to full-blown #Kubernetes and recently AI models. It's been a decade of invaluable, hands-on learning! 🛠️ #DevOps #SRE #Infrastructure
English
0
0
0
53
Wojtek (WojTech) Trocki
Wojtek (WojTech) Trocki@typeapi·
My self-hosted private cloud is 10 years old today! 🎂☁️ What started as a hobby to host projects and explore cloud tech became the foundation of my entire career. #Homelab #CloudNative
English
2
0
3
100
Wojtek (WojTech) Trocki retweetet
ℏεsam
ℏεsam@Hesamation·
"I use AI in a separate window. I don't enjoy Cursor or Windsurf, I can literally feel competence draining out of my fingers." @dhh, the legendary programmer and creator of Ruby on Rails has the most beautiful and philosophical idea about what AI takes away from programmers.
English
263
1.1K
10.3K
1.2M
Wojtek (WojTech) Trocki retweetet
Bilgin Ibryam
Bilgin Ibryam@bibryam·
Life is Lived in The Arena nav.al/arena @naval : Life is lived in the arena. You only learn by doing. And if you’re not doing, then all the learning you’re picking up is too general and too abstract. Then it truly is hallmark aphorisms. You don’t know what applies where and when. 🤯
English
0
3
6
1.5K
Wojtek (WojTech) Trocki retweetet
elvis
elvis@omarsar0·
Context Engineering Guide I'm writing a detailed guide on context engineering for AI devs. v1 is out now! (bookmark it) I use a concrete deep research multi-agent example to show what context engineering involves.
elvis tweet media
English
38
298
2K
287.5K
Wojtek (WojTech) Trocki retweetet
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Always the same story: Google builds an amazing *internal-only* tool that is better than anything else out there. Never turns it into an external product. Microsoft meanwhile uses the same products they build for external devs. And this is why Microsoft beats Google w dev tools
Steren@steren

GitHub PR review tool is way behind Google's internal code review tool. For performance, usability, and features

English
43
91
2.5K
214K
Wojtek (WojTech) Trocki retweetet
Dev Ittycheria
Dev Ittycheria@dittycheria·
I’ve lived in India, Canada, the UK & Nigeria—and traveled the world—but the USA gave my family opportunities we couldn’t find anywhere else. It’s not perfect, but it’s uniquely generous to those who dream big and work hard. So grateful. 🇺🇸 #July4th
English
2
3
73
3.1K
Wojtek (WojTech) Trocki retweetet
Wojtek (WojTech) Trocki retweetet
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Louder for people in the back I’m sick and tired of half-baked AI features released that make products worse and cannot be turned off Gemini for Google Docs / Sheets is one of many examples (this reminds me daily of what a feature that should have not been released looks like)
DHH@dhh

AI is awesome, but do you know what else is awesome? Not releasing AI-powered features until you've actually built something that's way better than what it was without. Not every feature you build or explore has to ship! (Apple used to know this).

English
42
87
1.2K
103.4K
Wojtek (WojTech) Trocki retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few days it was worse than what I was used to. Conversely as an example, around the same time Claude 3.5 was a top tier model in my personal use but it ranked very low on the arena. I heard similar sentiments both online and in person. And there were a number of other relatively random models, often suspiciously small, with little to no real-world knowledge as far as I know, yet they ranked quite high too. "When the data and the anecdotes disagree, the anecdotes are usually right." (Jeff Bezos on a recent pod, though I share the same experience personally). I think these teams have placed different amount of internal focus and decision making around LM Arena scores specifically. And unfortunately they are not getting better models overall but better LM Arena models, whatever that is. Possibly something with a lot of nested lists, bullet points and emoji. It's quite likely that LM Arena (and LLM providers) can continue to iterate and improve within this paradigm, but in addition I also have a new candidate in mind to potentially join the ranks of "top tier eval". It is the @openrouter LLM rankings: openrouter.ai/rankings Basically, OpenRouter allows people/companies to quickly switch APIs between LLM providers. All of them have real use cases (not toy problems or puzzles), they have their own private evals, and all of them have an incentive to get their choices right, so by choosing one LLM over another they are directly voting for some combo of capability+cost. I don't think OpenRouter is there just yet in both the quantity and diversity of use, but something of this kind I think has great potential to grow into a very nice, very difficult to game eval.
Arena.ai@arena

Thanks for the authors’ feedback, we’re always looking to improve the platform! If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the leaderboard is biased; see the clarification below. The leaderboard reflects millions of fresh, real human preferences. One might disagree with human preferences—they’re subjective—but that’s exactly why they matter. Understanding subjective preference is essential to evaluating real-world performance, as these models are used by people. That’s why we’re working on statistical methods—like style and sentiment control—to decompose human preference into its constituent parts. We are also strengthening our user base to include more diversity. And if pre-release testing and data helps models optimize for millions of people’s preferences, that’s a positive thing! Pre-release model testing is also a huge part of why people come to LMArena. Our community loves being the first to test the best and newest AIs! That’s why we welcome all model providers to submit their AIs to battle and win the preferences of our community. Within our capacity, we are trying to satisfy all requests for testing we get from model providers. We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference. If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly. Every model provider makes different choices about how to use and value human preferences. We helped Meta with pre-release testing for Llama 4, like we have helped many other model providers in the past. We support open-source development. Our own platform and analysis tools are open source, and we have released millions of open conversations as well. This benefits the whole community. We agree with a few of this writeup’s suggestions (e.g. implementing an active sampling algorithm) and are happy to consider more. Unfortunately, there are also a number of factual errors and misleading statements in this writeup. - The simulation of LMArena, e.g. in Figures 7/8, is flawed. It’s like saying: “The average 3-point percentage in the NBA is 35%. Steph Curry has the highest 3-point percentage in the NBA at 42%. This is unfair, because he comes from the distribution of NBA players, and they all have the same latent mean.” - We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly. - Many of the numbers in the paper do not reflect reality: see the blog below (released a few days ago) for the actual statistics on the number of models tested from different providers. See also in thread our longstanding policy on pre-release testing. We have been doing so transparently with the support of our community for over a year.

English
182
418
4.3K
689.9K
Wojtek (WojTech) Trocki retweetet
Gergely Orosz
Gergely Orosz@GergelyOrosz·
I'm now doing "queries" like this (that I could not do before): "Has there been an unsual spike of signups the last 2 months?" "What are suspicious-looking emails signing up recently? Any patterns?" "Which domains have the most number of signups?" "How many unclaimed promo codes are left?" The revolutionary is not about how this can be done in the IDE TBH. It's that an LLM "layer" can be so easily added in front of my database! I could answer all of those questions: by writing queries or even small batched processing... and for the effort, I would not bother. But like this: it's fast! And I discovered a batch of what looks like fraudulent signups that would have gone undetected if it was not so easy to interact with my DB! (now working on cancelling out those codes!)
English
11
9
187
55.7K