Evi

4.8K posts

Evi banner
Evi

Evi

@geteviapp

AI

San Francisco, USA Beigetreten Şubat 2025
995 Folgt421 Follower
Evi
Evi@geteviapp·
@petergostev Note Amazon Ads revenue. It is fun how folks who didn’t read their SEC filings claim Amazon is about AWS lol
English
0
0
0
68
Peter Gostev (SF: 29 Mar - 3 Apr)
Open AI's initial ad revenue in context. Note that this is a log scale, otherwise it would look ridiculous. To be fair, 1/6th of New York Times in a matter of months is not bad at all
Peter Gostev (SF: 29 Mar - 3 Apr) tweet media
Stephanie Palazzolo@steph_palazzolo

New: OpenAI has surpassed $100m in ARR from its ads pilot, which launched 6 weeks ago. It's expanded to 600+ advertisers and plans to launch self-serve advertiser access in April. theinformation.com/briefings/excl…

English
4
3
64
7.9K
Evi
Evi@geteviapp·
@paraschopra @fchollet In another thread you said that rephrasing the original post (ie agreeing in a shallow way, not sharing net new unknown hard to gain discovery grounded in experimental evidence) is AI slop! Please work harder!
English
0
0
0
2
Paras Chopra
Paras Chopra@paraschopra·
@fchollet Yes, we need more evals that test different dimensions of intelligence. ARC-AGI 3 is well thought of and step in the right direction. We need harder but solvable benchmarks for AI. The exact terms we use for them (AGI / ASI / whatever) matter much less IMO.
English
1
0
5
834
François Chollet
François Chollet@fchollet·
If you care about the rate of AGI progress, you should be excited about a new eval that focuses research efforts by pointing out important gaps & providing a way to measure progress towards fixing them If instead you only care about having your preconceptions confirmed, too bad
English
46
26
468
21K
Evi
Evi@geteviapp·
@paraschopra Response: It could only be sad to see the point. Would you like anything else?
English
0
0
0
7
Paras Chopra
Paras Chopra@paraschopra·
More than half of the replies on my tweets is from bots who simply rephrase what I just tweeted. What's the point of it? It's really sad.
English
89
2
254
16.4K
Evi
Evi@geteviapp·
@stochasticchasm No one else has the data they do so publishing the training “secrets” is easy for them :) don’t use someone’s business interests as a measure for your favoritism :)
English
0
0
0
62
Evi
Evi@geteviapp·
@martin_casado @stuffyokodraws Always feels like that until you actually need this for something and try to use a new model and discover “ragged frontier”. OpenAI paused video models for a reason :) world models they will build instead are thinking video models
English
0
0
0
53
Evi
Evi@geteviapp·
@DanKulkov It obviously does, check number of tokens for lower case and upper case and if you mess up the case it is even worse (2-4x difference)
English
0
0
0
28
Dan Kulkov
Dan Kulkov@DanKulkov·
i am glad capslock doesn't cost more tokens otherwise my screaming at opus would require $2000/mo plan
English
12
0
34
2.3K
Evi
Evi@geteviapp·
@Austen Google models have bad default taste. Claude is shockingly good in design of new things based on vague prompts.
English
0
0
1
113
Erik Bernhardsson
Erik Bernhardsson@bernhardsson·
The sandbox revenue for @modal is now as much as the total revenue of the company 9 months ago
English
23
9
531
51.3K
Evi
Evi@geteviapp·
@apjacob03 Compile it to 80x86 and run on a Mac to add a layer more and inside a container just for fun:)
English
0
0
0
94
Athul Paul Jacob
Athul Paul Jacob@apjacob03·
We compiled the transformer VM itself to WebAssembly (WASM) and paired it with a WASM-compiled C compiler running in the browser locally. This is basically 3 nested virtual machines: a WASM compiler producing bytecode, which gets tokenized and fed to a transformer that simulates WASM execution, itself running as WASM. 😅
English
5
8
98
4.8K
Evi
Evi@geteviapp·
@israelwegierski @cramforce Running transformer locally makes no sense because you get small batch size, it is also inconvenient to have space and cooling setup. There is no known economically sensible way to run 4-10T SOTA model on premises. Small models like in iPhone camera are ok of course,but not LLMs.
English
0
0
0
25
Israel Wegierski
Israel Wegierski@israelwegierski·
Hey @cramforce — do you see a future where coding agents (OpenCode-style) run entirely on serverless primitives (Chat SDK, AI SDK, Workflows, Sandbox), or will they always need a persistent runtime layer?
English
1
0
2
1.8K
Evi
Evi@geteviapp·
@ItsBrain4Brain @twlvone @GordonWetzstein modern LLMs produce logprobs over 100k dictionary, projecting (i.e. selecting specific token from those using logprobs and other params like T) is a kind of a tool, you may even call that harness
English
0
0
1
9
Gordon Wetzstein
Gordon Wetzstein@GordonWetzstein·
High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵
English
24
110
1.1K
143.2K
Evi
Evi@geteviapp·
@LLMJunky If you observe the commit times the bro is clearly in EU/UK, most likely from their London office :)
English
0
0
0
7
am.will
am.will@LLMJunky·
Been digging through the Codex CLI repo and there's a new multi-agent system being built: Multi Agent v2 Here's what's changing and why it matters for agent orchestration 👇🧵
am.will tweet media
English
19
8
111
50.6K
Peter Gostev (SF: 29 Mar - 3 Apr)
I'm very curious about OpenAI's planned intern researcher release by September this year. Having tried using current LLMs for OpenAI's Golf Challenge, I would say that Codex & Opus are actively bad researchers (not meaningful difference between them) - They come up with small ideas anchored to what we have - they find it hard to step back and try another route - They set up bad experiments with fallbacks other cheats - They are terrible judges at what is actually meaningful - every idea is rated 9/10 and then nothing works Bear in mind that some of these things are not too bad in the land of software, e.g. in software you do want reasonable fallbacks, you do want to build something that works that might mean a smaller iteration rather than tearing the whole thing down. But for research (even in a small sense, e.g. tuning a prompt) - this is really bad. I want the models to be genuinely stepping back and assessing if they are barking up the wrong tree. I want them to design clean experiments that don't muddy the water with dumb fallbacks that make it seem like something is working. It doesn't feel obvious to me that you could easily have a single LLM (in the short term at least) that could be a great software engineer and a great researcher. I'm sure we'll get there at some point, but I'd best that the research intern would feel quite different to Codex, if it would actually be good at research.
English
5
3
68
8.2K
Evi
Evi@geteviapp·
@a1zhang Try gpt-4 (original “sparks of AGI” GPT) with modern Codex harness. You’ll see harness doesn’t help much if the model didn’t learn lots of skills during training. It is the same how agents overtook workflows in usefulness and completion rates and quality. Harness is temporary.
English
0
0
0
122
alex zhang
alex zhang@a1zhang·
guess we disagree
alex zhang tweet media
Mike Knoop@mikeknoop

LLM systems swallow harness progress. The most general/universal LLM innovations migrate from client-side harnesses to server-side tools. Innovation typically happens first inside the harness. For example, AI reasoning was originally a harness around GPT-3 ("let's think step by step"). This approach worked so well that it migrated behind the API as a tool (competitive reasons were also a factor; but general utility dominated). Many wouldn't think of AI reasoning as a tool but it definitely is (it's a tool to do natural language program synthesis -- but that's another topic). The same happened with code interpreter which started out as a client-side harness and moved server-side. These tools are made available at inference time to the model alongside specific training to teach the model when and how to use each tool. Because of this, the line between tool and model can get quite blurry. Best to consider such tools as "internal" to the LLM system. This is actually a good test of how general a harness feature is. If a feature remains "stuck" client-side, say inside codex or claude code, then it's likely very task- or domain- specific. Client-side harnesses typically encode a lot of human G factor for specific domains. Whereas tools, due to usage pressure of frontier LLMs, are required to be as general as possible else they wouldn't make the cut. So if you care about measuring AGI it's a good idea to pay attention to default LLM system capabilities behind high usage LLM APIs. And if you care about bleeding edge research ideas, such as RLMs, it's a good idea to pay attention to harness innovation. Ultimately, AGI will not depend on a harness in the same sense humans don't depend on a harness.

English
5
10
185
22.9K
Evi
Evi@geteviapp·
@intellectronica Also numbers are recent and unnatural! And electricity is dangerous! Should we mention nuclear?
English
0
0
0
8
Eleanor Berger
Eleanor Berger@intellectronica·
Wooohooo ... thinking effort in @code chat!!
Eleanor Berger tweet media
English
2
3
12
1.9K
Evi
Evi@geteviapp·
@eastdakota @Cloudflare You ok? The paper is 1y old and if you read it you’ll learn that blog post exaggerates the positive sides and neglects negatives.
English
1
0
3
278
Matthew Prince 🌥
Matthew Prince 🌥@eastdakota·
This is Google’s DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
25
34
637
193.2K