Nico

492 posts

Nico

@itmos__

Katılım Ocak 2025

2.7K Takip Edilen116 Takipçiler

Nico retweetledi

Omar Khattab@lateinteraction·4d

Agents often externalize some context: a repository in coding agents, a corpus in RAG, and the user prompt in an RLM. New work by @astrogu_ shows that agents work better if they're allowed to manage a small buffer in their context window as a "cache" for that external context.

Joshua Gu@astrogu_

Recent agentic systems (Claude Code, Codex, RLM, etc.) push context out of the prompt and into the environment (e.g., as files). This helps them maintain long-term knowledge about their goals and functionality. 🚨 While this is a good idea, we show a surprising result: systems that use external environments like this perform much better when given a small, fixed-size, in-context, agent-managed cache that "𝘱𝘦𝘦𝘬𝘴 𝘪𝘯𝘵𝘰" these environments. 🚀 Our paper, 𝗣𝗘𝗘𝗞: 𝙖 𝙨𝙮𝙨𝙩𝙚𝙢 𝙛𝙤𝙧 𝙗𝙪𝙞𝙡𝙙𝙞𝙣𝙜 𝙖𝙣𝙙 𝙢𝙖𝙞𝙣𝙩𝙖𝙞𝙣𝙞𝙣𝙜 𝗮𝗻 𝗼𝗿𝗶𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗰𝗮𝗰𝗵𝗲 𝙛𝙤𝙧 𝙇𝙇𝙈 𝙖𝙜𝙚𝙣𝙩𝙨, introduces this idea. Compared with strong baselines, including RAG, Compaction Agents, and SOTA prompt-learning frameworks, PEEK dominates the cost–quality Pareto frontier: achieving +6.3–34.0% in quality, with fewer iterations and lower cost. Paper: arxiv.org/abs/2605.19932 GitHub: github.com/zhuohangu/peek More in the thread below! (1/N)

English

171

18.2K

Nico@itmos__·4d

@JasonBotterill @Ken67547214 Ken is right, it's Lynchian weird, not anime weird. It's by Satoshi Kon, director of Perfect Blue, Millennium Actress, and Paprika. Famously ripped off by Aronofsky and Nolan. He was a top tier auteur, one of the goats

English

JB@JasonBotterill·4d

@Ken67547214 I don’t watch anime should I watch this it’s not weird right

English

Ken 無 (non-official taco bell affiliate)@Ken67547214·4d

I keep going back and forth on cameras in my house. I was pretty much fine with the tradeoffs of letting my own software watch me through a webcam at my desk, but I do not like the idea of having internet connected cameras distributed anywhere else in my home. However, I think I could come up with an affordable system that used an (mostly) airgapped mini-pc to extract coordinate, identity, and state data through a minimal data connection.

English

792

Nico@itmos__·5d

@N8Programs - most interesting or surprising paper you've read in the last month? - favorite film? - the @voooooogel Claude chrome extension might be your magnum opus. How are you planning to top that?

English

135

N8 Programs@N8Programs·5d

Thank you so much to everyone for 10K followers! It's been my absolute honor to post my work here for the last few years - from Three.JS to Local LLMs to MLX to the research I do now. As is customary (and cringe), I'm doing an AMA - post any questions you have in the comments!

English

4.7K

Nico retweetledi

Niels Rogge@NielsRogge·18 May

Introducing a revival of PapersWithCode! As @ilyasut said, we're back to the "age of research". Hence, it's important to share research and build on each other's work. > find SOTA per domain, not just LLMs > leaderboards > methods > all parsed at scale using AI agents.

English

593

64.3K

Nico@itmos__·5d

@holynski_ Wepa 🇵🇷

Indonesia

155

Aleksander Holynski@holynski_·6d

My whole life, I've wanted to be an elephant riding a motorcycle through my hometown. Now, it's finally possible.

Ben Poole@poolio

Real-world models are here! Stoked to share how we're bringing real-world locations to life by integrating Street View into Genie. Try it now at labs.google/fx/projectgenie and read the blog for more info: blog.google/innovation-and…

English

435

73.6K

Nico retweetledi

jason@jxnlco·18 May

jason from the codex team here, heres a draft on codex maxxing and the primatives i use on a daily basis jxnl.github.io/blog/writing/2… would love any feedback

English

151

217

3.5K

378.1K

Nico@itmos__·17 May

@techno_popgirl Happy birthday! 🥳 I'm listening to your wonderful music right now.

English

大正九年@techno_popgirl·16 May

今日の誕生日ありがとうございました！

日本語

2.7K

Nico retweetledi

Sophie Wang@SophieLWang·12 May

"The Truth Lies Somewhere in the Middle (of the Generated Tokens)" In autoregressive language models, mean pooling hidden states across generation yields better representations than any token alone. project page: sophielwang.com/tokens w/ @phillip_isola and @thisismyhat

English

465

48.7K

Nico@itmos__·11 May

@corsaren it's too taboo so most actual artists aren't touching it yet. I really like Jia Zhangke's short, mostly because I respect him as a director. It's metacommentary and doesn't stand on its own, but it's good and pretty funny:

English

682

corsaren@corsaren·11 May

Okay, but can someone with good taste please make one of these? Surely *someone* has an actual good short film idea they’ve been sitting on. I want more than mere technical impressiveness.

Just Another Pod Guy@TMTLongShort

You aren’t ready. You think you are, but you’re not.

English

230

15.2K

Nico retweetledi

Tomasz Limisiewicz@TomLimi·4 May

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

621

100.8K

Nico retweetledi

Ricardo Olmedo@rdolmedo_·4 May

Claude 3 Opus scored 4% on SWE-bench at release. Shockingly, a Pythia-scale model trained **only on pre-1931 data**, with a bit of fine-tuning, outperforms the April 2024 SOTA. Clearly, Opus is the better model. Why should we care about benchmarks, then? 👇🧵

Ricardo Olmedo@rdolmedo_

We fine-tuned Alec Radford’s 1930 vintage LLM to solve SWE-bench issues. After just ‼️250‼️ training examples, the model solves its first issue, a simple patch to the xarray library. 🧵👇

English

411

114.4K

Nico@itmos__·4 May

@postimortem @_lopopolo by "tools and context" he's referring to integrations to your org's tools, plugins, access to your internal KB, etc. instead of your own harness with custom tool defs, session management, compaction algorithm, etc.

English

tim ganiev@postimortem·4 May

@_lopopolo but tools and context are the major part of harness...

English

131

Ryan Lopopolo@_lopopolo·3 May

Josh@JoshPurtell

One thing I've noticed about Harness discussion is there are some people who think it means *the cybernetics work you do to connect models to value* and others see it as a convention for terminal interaction tool shapes and there is some talking past one another btwn them

ZXX

7.1K

Nico@itmos__·28 Nis

@atelicinvest truth nuke

English

Unemployed Capital Allocator@atelicinvest·28 Nis

the actual monkey paw - you can make anything you ever wanted, but whatever you wanted to make isn't actually what you need at all and you just build up and up and up because doing it feels better than actually doing the hard work of figuring out what is useful instead of what it is you want to do it's all just a larp video game otherwise and performance productivity porn and 2 years later you look around and wonder what you did with all those billions of tokens you produced but you're too afraid to actually answer that question so you just keep on building just another billion tokens. i swear this time it'll be good.

goodalexander@goodalexander

AI monkey paw: you can make anything you ever wanted but nobody will buy it bc they can also make anything they ever wanted

English

5.5K

Nico@itmos__·28 Nis

@max_spero_ Let me in!

English

Nico retweetledi

wh@nrehiew_·26 Nis

Evaluated GPT 5.5 on Over Editing. - It is less prone to over-editing code compared to GPT 5.4, but still lags behind Opus 4.6 - The difference between xhigh and non-reasoning is minimal (Lower is better in these plots)

wh@nrehiew_

Frontier LLMs are doing too much when it comes to editing code. I'm excited to share this work on the Over-Editing problem which refers to models modifying code beyond what is asked of them. The main findings are: - Many frontier models Over-Edit with GPT 5.4 being the biggest culprit - Reasoning models have a higher natural tendency to Over-Edit compared to their non-reasoning counterparts - RL is the best approach to train models to perform minimal code editing while preventing catastrophic forgetting compared to SFT, DPO and Rejection Sampling. Blog and details below!

English

6.2K

Nico retweetledi

Max Spero@max_spero_·24 Nis

Over the last year, I've watched a rise in AI content on basically every internet platform. Seeing a viral AI-generated post used to be a rare find. Now it's a daily occurrence. Four months ago, we launched the @pangramlabs bot to help people check long posts and articles for AI slop without leaving the platform. And it blew up. We went from a niche tool used by academics to a core piece of cognitive security infrastructure. Today, we're taking it one step further. We're launching a Chrome extension that proactively scans all social content as you scroll, flagging AI content in real time so you can save your attention for what really matters: content authored by humans. At launch, the Pangram Chrome extension will proactively scan posts on X, LinkedIn, Reddit, Substack, and Medium. And we'll give you a feed health summary, so you can see exactly which accounts are putting AI slop on your feed. I'm so excited to share this with you all, and I hope you find it as useful as I do.

Pangram Labs@pangramlabs

Today we're releasing the Pangram Chrome Extension, which automatically flags AI-generated content as you scroll your feed. We're sick of having to constantly be on guard for AI slop on social media. For most of human history, if a piece of writing was grammatical, coherent, and well-structured, you were safe in assuming that somebody put some thought into producing it. That assumption no longer holds true: AI has severed the relationship between form and content, destroying the credibility signal we once relied on. The Pangram Chrome extension restores that signal. It scans your feed as you scroll, flagging AI-generated and AI-assisted content in real time and showing you how much of your feed is machine-written. Works on X, LinkedIn, Reddit, Substack, and Medium. New users get 2 weeks free. Install it here: pangram.com/solutions/chro…

English

443

84.5K

Nico@itmos__·24 Nis

@Plinz @mattparlmer > You are asking to exclude millions of people who cannot afford renting human drivers from being able to getting around Waymos generally cost 1.5-2x as much as Uber. > How about we build infrastructure again? you mean public transportation? yeah I agree

English

Joscha Bach@Plinz·23 Nis

@mattparlmer You are asking to exclude millions of people who cannot afford renting human drivers from being able to getting around. That is an incredible social cost. Let's find other sources of income. How about we build infrastructure again?

English

2.8K

mattparlmer 🪐 🌷@mattparlmer·23 Nis

The problem isn’t with the Waymo safety record, it’s that driverless taxis break a load bearing part of the post-2008 social settlement The ability to earn an income from driving and delivery apps has kept a lot of people afloat who would otherwise be entirely destitute

Timothy B. Lee@binarybits

Over the last ~100 million miles of driving, these are the five most serious crashes that could be plausibly blamed on Waymo, as judged by @chi_t_williams and me. If you looked at 10,000 miles of driving from 10,000 random human drivers you'd see much worse behavior.

English

107

1.7K

229.1K

Nico retweetledi

wh@nrehiew_·22 Nis

English

437

64.2K

Nico retweetledi

ClaudeDevs@ClaudeDevs·17 Nis

Some of you ran into Opus 4.7 refusing normal code edits with "this might be malware" warnings. That was a bug on our side, not the model being cautious. Older builds applied a stale safety prompt that Opus 4.7 doesn't need. Run claude update or relaunch the app.

English

170

4.6K

317.8K

Nico@itmos__·17 Nis

@VictorTaelin they claim to have fixed "a lot of bugs" (no details) from yesterday. Can you resend the pormpts? notice any difference? x.com/alexalbert__/s…

Alex Albert@alexalbert__

A lot of bugs that folks may have hit yesterday when first trying Opus 4.7 are now fixed. Thanks for bearing with us🙏

English

535

Taelin@VictorTaelin·17 Nis

I don't think we're all hallucinating, there's something seriously wrong about 4.7. Just tried it on the same two prompt (what's the best GC approach for Bend). 4.7 simply lies a lot, ignores information right on its context, makes bad proposals. This is really weird?

English

115

1.3K

69.8K

Keşfet

@astrogu_ @JasonBotterill @Ken67547214 @N8Programs @voooooogel @ilyasut @holynski_ @techno_popgirl