Evi

4.8K posts

Evi banner
Evi

Evi

@geteviapp

AI

San Francisco, USA เข้าร่วม Şubat 2025
995 กำลังติดตาม421 ผู้ติดตาม
Evi
Evi@geteviapp·
@stochasticchasm No one else has the data they do so publishing the training “secrets” is easy for them :) don’t use someone’s business interests as a measure for your favoritism :)
English
0
0
0
61
Evi
Evi@geteviapp·
@martin_casado @stuffyokodraws Always feels like that until you actually need this for something and try to use a new model and discover “ragged frontier”. OpenAI paused video models for a reason :) world models they will build instead are thinking video models
English
0
0
0
52
Evi
Evi@geteviapp·
@DanKulkov It obviously does, check number of tokens for lower case and upper case and if you mess up the case it is even worse (2-4x difference)
English
0
0
0
27
Dan Kulkov
Dan Kulkov@DanKulkov·
i am glad capslock doesn't cost more tokens otherwise my screaming at opus would require $2000/mo plan
English
12
0
34
2.3K
Evi
Evi@geteviapp·
@Austen Google models have bad default taste. Claude is shockingly good in design of new things based on vague prompts.
English
0
0
1
113
Erik Bernhardsson
Erik Bernhardsson@bernhardsson·
The sandbox revenue for @modal is now as much as the total revenue of the company 9 months ago
English
22
9
529
51.1K
Evi
Evi@geteviapp·
@apjacob03 Compile it to 80x86 and run on a Mac to add a layer more and inside a container just for fun:)
English
0
0
0
94
Athul Paul Jacob
Athul Paul Jacob@apjacob03·
We compiled the transformer VM itself to WebAssembly (WASM) and paired it with a WASM-compiled C compiler running in the browser locally. This is basically 3 nested virtual machines: a WASM compiler producing bytecode, which gets tokenized and fed to a transformer that simulates WASM execution, itself running as WASM. 😅
English
5
8
97
4.8K
Evi
Evi@geteviapp·
@israelwegierski @cramforce Running transformer locally makes no sense because you get small batch size, it is also inconvenient to have space and cooling setup. There is no known economically sensible way to run 4-10T SOTA model on premises. Small models like in iPhone camera are ok of course,but not LLMs.
English
0
0
0
24
Israel Wegierski
Israel Wegierski@israelwegierski·
Hey @cramforce — do you see a future where coding agents (OpenCode-style) run entirely on serverless primitives (Chat SDK, AI SDK, Workflows, Sandbox), or will they always need a persistent runtime layer?
English
1
0
2
1.8K
Evi
Evi@geteviapp·
@ItsBrain4Brain @twlvone @GordonWetzstein modern LLMs produce logprobs over 100k dictionary, projecting (i.e. selecting specific token from those using logprobs and other params like T) is a kind of a tool, you may even call that harness
English
0
0
1
9
Gordon Wetzstein
Gordon Wetzstein@GordonWetzstein·
High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵
English
24
109
1.1K
142.7K
Evi
Evi@geteviapp·
@LLMJunky If you observe the commit times the bro is clearly in EU/UK, most likely from their London office :)
English
0
0
0
7
am.will
am.will@LLMJunky·
Been digging through the Codex CLI repo and there's a new multi-agent system being built: Multi Agent v2 Here's what's changing and why it matters for agent orchestration 👇🧵
am.will tweet media
English
19
8
110
49K
Peter Gostev (SF: 29 Mar - 3 Apr)
I'm very curious about OpenAI's planned intern researcher release by September this year. Having tried using current LLMs for OpenAI's Golf Challenge, I would say that Codex & Opus are actively bad researchers (not meaningful difference between them) - They come up with small ideas anchored to what we have - they find it hard to step back and try another route - They set up bad experiments with fallbacks other cheats - They are terrible judges at what is actually meaningful - every idea is rated 9/10 and then nothing works Bear in mind that some of these things are not too bad in the land of software, e.g. in software you do want reasonable fallbacks, you do want to build something that works that might mean a smaller iteration rather than tearing the whole thing down. But for research (even in a small sense, e.g. tuning a prompt) - this is really bad. I want the models to be genuinely stepping back and assessing if they are barking up the wrong tree. I want them to design clean experiments that don't muddy the water with dumb fallbacks that make it seem like something is working. It doesn't feel obvious to me that you could easily have a single LLM (in the short term at least) that could be a great software engineer and a great researcher. I'm sure we'll get there at some point, but I'd best that the research intern would feel quite different to Codex, if it would actually be good at research.
English
5
3
67
8.2K
Evi
Evi@geteviapp·
@a1zhang Try gpt-4 (original “sparks of AGI” GPT) with modern Codex harness. You’ll see harness doesn’t help much if the model didn’t learn lots of skills during training. It is the same how agents overtook workflows in usefulness and completion rates and quality. Harness is temporary.
English
0
0
0
120
alex zhang
alex zhang@a1zhang·
guess we disagree
alex zhang tweet media
Mike Knoop@mikeknoop

LLM systems swallow harness progress. The most general/universal LLM innovations migrate from client-side harnesses to server-side tools. Innovation typically happens first inside the harness. For example, AI reasoning was originally a harness around GPT-3 ("let's think step by step"). This approach worked so well that it migrated behind the API as a tool (competitive reasons were also a factor; but general utility dominated). Many wouldn't think of AI reasoning as a tool but it definitely is (it's a tool to do natural language program synthesis -- but that's another topic). The same happened with code interpreter which started out as a client-side harness and moved server-side. These tools are made available at inference time to the model alongside specific training to teach the model when and how to use each tool. Because of this, the line between tool and model can get quite blurry. Best to consider such tools as "internal" to the LLM system. This is actually a good test of how general a harness feature is. If a feature remains "stuck" client-side, say inside codex or claude code, then it's likely very task- or domain- specific. Client-side harnesses typically encode a lot of human G factor for specific domains. Whereas tools, due to usage pressure of frontier LLMs, are required to be as general as possible else they wouldn't make the cut. So if you care about measuring AGI it's a good idea to pay attention to default LLM system capabilities behind high usage LLM APIs. And if you care about bleeding edge research ideas, such as RLMs, it's a good idea to pay attention to harness innovation. Ultimately, AGI will not depend on a harness in the same sense humans don't depend on a harness.

English
5
10
184
22.8K
Evi
Evi@geteviapp·
@intellectronica Also numbers are recent and unnatural! And electricity is dangerous! Should we mention nuclear?
English
0
0
0
8
Eleanor Berger
Eleanor Berger@intellectronica·
Wooohooo ... thinking effort in @code chat!!
Eleanor Berger tweet media
English
2
3
12
1.9K
Evi
Evi@geteviapp·
@eastdakota @Cloudflare You ok? The paper is 1y old and if you read it you’ll learn that blog post exaggerates the positive sides and neglects negatives.
English
1
0
3
274
Matthew Prince 🌥
Matthew Prince 🌥@eastdakota·
This is Google’s DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
24
34
635
192.8K
Simon Høiberg
Simon Høiberg@SimonHoiberg·
@AdolfoUsier You generally don't recognize sarcasm when you see it? Or I did a poor job at making it clear enough here? 😅
English
3
0
14
1.4K
Evi
Evi@geteviapp·
@Dorialexander Exactly! You might still want to ask Codex to run Transformers or TRL stuff to check new implementation, but indeed: instead of adding configurability just get your agent rewrite the whole thing. LLM is a new compiler!
English
0
0
0
23
Alexander Doria
Alexander Doria@Dorialexander·
Spent nearly a month last year (w/ previous claude) to implement something that was nailed in 10 minutes now
English
4
1
21
1.5K
Evi
Evi@geteviapp·
@theo Local LLM people didn’t properly use Codex and thus miss out on understanding capabilities.
English
0
0
0
107
Theo - t3.gg
Theo - t3.gg@theo·
“Everything Claude Code can do, for free” Local model people have lost touch with reality
thestreamingdev()@thestreamingdev

→ Search the web for live sports scores and stock prices → Find files on my desktop and run shell commands → Write code and solve math problems → Everything Claude Code does — for free The breakthroughs that made this possible: Apple's "LLM in a Flash" paper showed models can page from SSD using unified memory. I proved it works in practice on consumer hardware — not just in a research lab. Google's TurboQuant research showed you can compress KV cache with zero quality loss. I applied this with two server flags and doubled my context window from 32K to 64K tokens. For free. No code changes. The biggest surprise: the 35B model at 2.6 bits per weight was supposed to have "broken" tool calling. Every agent framework I tried failed — infinite loops, no answers. I stopped asking the model to generate JSON function calls. Instead I ask it simple questions. "Is this a search, shell, or chat?" → one word answer. Works perfectly. The tool calling wasn't broken. The protocol was wrong. Both models. Full agent. Same $600 computer: → 35B MoE: 30 tok/s, 2x faster, smarter reasoning → 9B dense: 16 tok/s, 64K context, reads entire codebases I benchmarked everything: → 212 math problems: 86.3% accuracy (3 categories at 100%) → 10 web search categories: 10/10 accurate → Shell commands: finds videos, checks disk space, reads code → MLX vs llama.cpp: tested both, llama.cpp wins for 35B

English
115
25
1.3K
173.5K