Evi

4.8K posts

Evi

@geteviapp

San Francisco, USA เข้าร่วม Şubat 2025

995 กำลังติดตาม421 ผู้ติดตาม

Evi@geteviapp·20h

@stochasticchasm No one else has the data they do so publishing the training “secrets” is easy for them :) don’t use someone’s business interests as a measure for your favoritism :)

English

stochasm@stochasticchasm·22h

cursor has become one of my favorite labs in the last few days. i really hope they keep sharing more and more

Cursor@cursor_ai

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

English

165

7.7K

Evi@geteviapp·22h

@martin_casado @stuffyokodraws Always feels like that until you actually need this for something and try to use a new model and discover “ragged frontier”. OpenAI paused video models for a reason :) world models they will build instead are thinking video models

English

martin_casado@martin_casado·22h

Photo editing really feels solved at this point.

Phota Labs@PhotaLabs

Today, we introduce Phota Studio and Phota API, powered by our photography model that brings flagship image model capabilities, personalized to you. With personalization, an image model stops being just playful and starts becoming useful for photography. With Phota Studio, you can: - Reimagine composition, lighting, or posture while still looking like yourself - Create editorial, stylized, and studio-quality portraits of yourself, or bring someone you love into the frame - Revive the blurred shot, bring in the person who missed the group photo, fix the awkward expression - all without losing what made the moment worth keeping With Phota API, you can finally build photo experiences where real people are the core. Marketing assets, editorial shoots, wedding photography: workflows that needed identity fidelity that GenAI couldn't deliver. Until now. Ultimately, we want to make compelling photographs accessible to everyone. Phota API and Phota Studio start to make that possible: empowering people to explore, imagine, and create without losing themselves in the image. With Phota Studio and Phota API, developers can build new photo experiences, while photographers and creators can explore a new kind of AI-native editing and generation. The next photo experience starts here!

English

10.9K

Evi@geteviapp·1d

@DanKulkov It obviously does, check number of tokens for lower case and upper case and if you mess up the case it is even worse (2-4x difference)

English

Dan Kulkov@DanKulkov·1d

i am glad capslock doesn't cost more tokens otherwise my screaming at opus would require $2000/mo plan

English

2.3K

Evi@geteviapp·1d

@Austen Google models have bad default taste. Claude is shockingly good in design of new things based on vague prompts.

English

113

Austen Allred@Austen·1d

Using Codex to call in other models for the design part is the play. Google’s models crush at design.

Dwayne@CtrlAltDwayne

Codex is already really good and my daily driver. But they really need to fix how terrible GPT-5.4 regardless of reasoning effort is at design/UI. It's seriously really bad. Hopefully this means it gets fixed sooner so I don't have to keep using Gemini 3.1 Pro and Claude.

English

124

17K

Evi@geteviapp·1d

@bernhardsson @Shekswess @modal You learned a lesson how the world works I guess :) did this secure a session with PF? :)

English

221

Erik Bernhardsson@bernhardsson·1d

The sandbox revenue for @modal is now as much as the total revenue of the company 9 months ago

English

529

51.1K

Evi@geteviapp·1d

@apjacob03 Compile it to 80x86 and run on a Mac to add a layer more and inside a container just for fun:)

English

Athul Paul Jacob@apjacob03·1d

We compiled the transformer VM itself to WebAssembly (WASM) and paired it with a WASM-compiled C compiler running in the browser locally. This is basically 3 nested virtual machines: a WASM compiler producing bytecode, which gets tokenized and fed to a transformer that simulates WASM execution, itself running as WASM. 😅

English

4.8K

Evi@geteviapp·1d

@israelwegierski @cramforce Running transformer locally makes no sense because you get small batch size, it is also inconvenient to have space and cooling setup. There is no known economically sensible way to run 4-10T SOTA model on premises. Small models like in iPhone camera are ok of course,but not LLMs.

English

Israel Wegierski@israelwegierski·1d

@geteviapp @cramforce No, I mean that the model runs on local hardware

English

Israel Wegierski@israelwegierski·1d

Hey @cramforce — do you see a future where coding agents (OpenCode-style) run entirely on serverless primitives (Chat SDK, AI SDK, Workflows, Sandbox), or will they always need a persistent runtime layer?

English

1.8K

Evi@geteviapp·1d

@ItsBrain4Brain @twlvone @GordonWetzstein modern LLMs produce logprobs over 100k dictionary, projecting (i.e. selecting specific token from those using logprobs and other params like T) is a kind of a tool, you may even call that harness

English

Brain4brain@ItsBrain4Brain·1d

@geteviapp @twlvone @GordonWetzstein Using tool is different imo

English

Gordon Wetzstein@GordonWetzstein·2d

High-resolution image and video generation is hitting a wall because attention in DiTs scales quadratically with token count. But does every pixel need to be in full resolution? Introducing Foveated Diffusion: a new approach for efficient diffusion-based generation that allocates compute where it matters most. 1/7🧵

English

109

1.1K

142.7K

Evi@geteviapp·1d

@ItsBrain4Brain @twlvone @GordonWetzstein you're right actually, given tools (paintbrush) you can also draw them!

English

Brain4brain@ItsBrain4Brain·1d

@geteviapp @twlvone @GordonWetzstein I can produce images and videos in my head, anyone without aphantasia can do this

English

Evi@geteviapp·1d

@LLMJunky If you observe the commit times the bro is clearly in EU/UK, most likely from their London office :)

English

am.will@LLMJunky·2d

Posting the PRs below if you wanna geek out. Note: these are all from JIF. What a chad. github.com/openai/codex/c… github.com/openai/codex/c… github.com/openai/codex/c… github.com/openai/codex/c… github.com/openai/codex/c… github.com/openai/codex/c… github.com/openai/codex/c…

English

1.2K

am.will@LLMJunky·2d

Been digging through the Codex CLI repo and there's a new multi-agent system being built: Multi Agent v2 Here's what's changing and why it matters for agent orchestration 👇🧵

English

110

49K

Evi@geteviapp·1d

@petergostev @karpathy clearly explained that this is skill issue :)

English

346

Peter Gostev (SF: 29 Mar - 3 Apr)@petergostev·1d

I'm very curious about OpenAI's planned intern researcher release by September this year. Having tried using current LLMs for OpenAI's Golf Challenge, I would say that Codex & Opus are actively bad researchers (not meaningful difference between them) - They come up with small ideas anchored to what we have - they find it hard to step back and try another route - They set up bad experiments with fallbacks other cheats - They are terrible judges at what is actually meaningful - every idea is rated 9/10 and then nothing works Bear in mind that some of these things are not too bad in the land of software, e.g. in software you do want reasonable fallbacks, you do want to build something that works that might mean a smaller iteration rather than tearing the whole thing down. But for research (even in a small sense, e.g. tuning a prompt) - this is really bad. I want the models to be genuinely stepping back and assessing if they are barking up the wrong tree. I want them to design clean experiments that don't muddy the water with dumb fallbacks that make it seem like something is working. It doesn't feel obvious to me that you could easily have a single LLM (in the short term at least) that could be a great software engineer and a great researcher. I'm sure we'll get there at some point, but I'd best that the research intern would feel quite different to Codex, if it would actually be good at research.

English

8.2K

Evi@geteviapp·1d

@israelwegierski @cramforce Models are updated every 6 weeks, hardware needs 2 years lead.

English

Israel Wegierski@israelwegierski·1d

@geteviapp @cramforce The future will undoubtedly be models in the hardware but they are not yet so powerful.

English

Evi@geteviapp·1d

@a1zhang Try gpt-4 (original “sparks of AGI” GPT) with modern Codex harness. You’ll see harness doesn’t help much if the model didn’t learn lots of skills during training. It is the same how agents overtook workflows in usefulness and completion rates and quality. Harness is temporary.

English

120

alex zhang@a1zhang·1d

guess we disagree

Mike Knoop@mikeknoop

LLM systems swallow harness progress. The most general/universal LLM innovations migrate from client-side harnesses to server-side tools. Innovation typically happens first inside the harness. For example, AI reasoning was originally a harness around GPT-3 ("let's think step by step"). This approach worked so well that it migrated behind the API as a tool (competitive reasons were also a factor; but general utility dominated). Many wouldn't think of AI reasoning as a tool but it definitely is (it's a tool to do natural language program synthesis -- but that's another topic). The same happened with code interpreter which started out as a client-side harness and moved server-side. These tools are made available at inference time to the model alongside specific training to teach the model when and how to use each tool. Because of this, the line between tool and model can get quite blurry. Best to consider such tools as "internal" to the LLM system. This is actually a good test of how general a harness feature is. If a feature remains "stuck" client-side, say inside codex or claude code, then it's likely very task- or domain- specific. Client-side harnesses typically encode a lot of human G factor for specific domains. Whereas tools, due to usage pressure of frontier LLMs, are required to be as general as possible else they wouldn't make the cut. So if you care about measuring AGI it's a good idea to pay attention to default LLM system capabilities behind high usage LLM APIs. And if you care about bleeding edge research ideas, such as RLMs, it's a good idea to pay attention to harness innovation. Ultimately, AGI will not depend on a harness in the same sense humans don't depend on a harness.

English

184

22.8K

Evi@geteviapp·1d

@intellectronica Also numbers are recent and unnatural! And electricity is dangerous! Should we mention nuclear?

English

Eleanor Berger@intellectronica·1d

Reading was transitional tech. It's very recent, hard to use, feels unnatural. It won't survive for much longer now that you can talk with computers.

Reuben Rodriguez@ReubenR80027912

I refuse to believe that most Americans can not read at a 7th grade level If this is “true” then it means we are simply measuring literacy incorrectly; 163M Americans are actively employed Can your 6th grader read an employment contract? Read heavy machinery instructions?

English

623

Evi@geteviapp·1d

@intellectronica @burkeholland @code xhigh is missing!

English

Eleanor Berger@intellectronica·1d

Wooohooo ... thinking effort in @code chat!!

English

1.9K

Evi@geteviapp·1d

@eastdakota @Cloudflare You ok? The paper is 1y old and if you read it you’ll learn that blog post exaggerates the positive sides and neglects negatives.

English

274

Matthew Prince 🌥@eastdakota·2d

This is Google’s DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

635

192.8K

Evi@geteviapp·1d

@iScienceLuvr Hi:)

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·1d

if you can reply to this, you're awesome and cool :)

English

171

362

41.4K

Evi@geteviapp·1d

@SimonHoiberg @AdolfoUsier You did great, folks are overreacting :)

English

Simon Høiberg@SimonHoiberg·1d

@AdolfoUsier You generally don't recognize sarcasm when you see it? Or I did a poor job at making it clear enough here? 😅

English

1.4K

Adolfo 🦀🔺 | Open Crabs Father | truelens.tech®️@AdolfoUsier·1d

Killled N8N? Another trust me bro. Research a bit maybe

English

2.4K

Evi@geteviapp·1d

@Dorialexander Exactly! You might still want to ask Codex to run Transformers or TRL stuff to check new implementation, but indeed: instead of adding configurability just get your agent rewrite the whole thing. LLM is a new compiler!

English

Alexander Doria@Dorialexander·1d

Spent nearly a month last year (w/ previous claude) to implement something that was nailed in 10 minutes now

English

1.5K

Alexander Doria@Dorialexander·1d

So far my working solution for RL infra is to just let Opus re-implement from scratch (and never use trl)

ueaj@_ueaj

Wow I fucking hate RL research this is genuinely horrible dogshit. Like 99% of my time has been spent figuring out why my lora won't update on my inference server or other technical problems

English

119

13.1K

Evi@geteviapp·1d

@theo Local LLM people didn’t properly use Codex and thus miss out on understanding capabilities.

English

107

Theo - t3.gg@theo·1d

“Everything Claude Code can do, for free” Local model people have lost touch with reality

thestreamingdev()@thestreamingdev

→ Search the web for live sports scores and stock prices → Find files on my desktop and run shell commands → Write code and solve math problems → Everything Claude Code does — for free The breakthroughs that made this possible: Apple's "LLM in a Flash" paper showed models can page from SSD using unified memory. I proved it works in practice on consumer hardware — not just in a research lab. Google's TurboQuant research showed you can compress KV cache with zero quality loss. I applied this with two server flags and doubled my context window from 32K to 64K tokens. For free. No code changes. The biggest surprise: the 35B model at 2.6 bits per weight was supposed to have "broken" tool calling. Every agent framework I tried failed — infinite loops, no answers. I stopped asking the model to generate JSON function calls. Instead I ask it simple questions. "Is this a search, shell, or chat?" → one word answer. Works perfectly. The tool calling wasn't broken. The protocol was wrong. Both models. Full agent. Same $600 computer: → 35B MoE: 30 tok/s, 2x faster, smarter reasoning → 9B dense: 16 tok/s, 64K context, reads entire codebases I benchmarked everything: → 212 math problems: 86.3% accuracy (3 categories at 100%) → 10 web search categories: 10/10 accurate → Shell commands: finds videos, checks disk space, reads code → MLX vs llama.cpp: tested both, llama.cpp wins for 35B

English

115

1.3K

173.5K

ค้นพบ

@stochasticchasm @martin_casado @stuffyokodraws @DanKulkov @Austen @bernhardsson @Shekswess @modal