Core Francisco Park @ NeurIPS2025

269 posts

Core Francisco Park @ NeurIPS2025

@corefpark

@Harvard. Science of Intelligence

Cambridge, MA, USA Katılım Mayıs 2023

1.7K Takip Edilen1.2K Takipçiler

Sabitlenmiş Tweet

Core Francisco Park @ NeurIPS2025@corefpark·6 May

🚨 New Paper! A lot happens in the world every day—how can we update LLMs with belief-changing news? We introduce a new dataset "New News" and systematically study knowledge integration via System-2 Fine-Tuning (Sys2-FT). 1/n

Core Francisco Park @ NeurIPS2025 tweet media

English

250

36.3K

Core Francisco Park @ NeurIPS2025@corefpark·5d

@aryaman2020 @ValerioCapraro I see! Yeah I agree something is there in dense parts of data!

English

Aryaman Arora@aryaman2020·5d

@corefpark @ValerioCapraro I agree, but this is enough that it's crazy to claim "no internal representation" of truth exists. Something is there!

English

Valerio Capraro@ValerioCapraro·13 Mar

Here's the longer version of our Nature piece. Our argument is simple: statistical approximation is not the same thing as intelligence. Strong benchmark scores often say very little about how LLMs behave under novelty, uncertainty, or shifting goals. Even more importantly, similar behaviors can arise from fundamentally different processes. In another paper, we identified seven epistemological fault lines between humans and LLMs. For example, LLMs have no internal representation of what is true. They often generate confident contradictions, especially in longer interactions, because they do not track what is actually true. Another example. Yes, LLMs have solved some open mathematical problems, but these cases typically involve applying known methods to well-defined problems. LLMs cannot invent anything that is truly new and true at the same time, because they lack the epistemic machinery to determine what is true. None of this means LLMs are useless. Quite the opposite: they are extraordinarily useful. But we should be careful about what they are and what they are not. Producing plausible text is not the same as understanding. Statistical prediction is not the same as intelligence. So despite the hype from the usual suspects, AGI has not been achieved. * paper in the first reply Joint with @Walter4C and @GaryMarcus

English

187

775

137.2K

Core Francisco Park @ NeurIPS2025@corefpark·5d

@aryaman2020 @ValerioCapraro I think there is a big difference between "LLMs represent true/false in very data dense regimes" and having a consistent belief over what is the truth and operating based on it.

English

Aryaman Arora@aryaman2020·13 Mar

@ValerioCapraro it is at the very least misinformed to claim the fifth paragraph in light of much recent quirk in interpretability, e.g. arxiv.org/abs/2310.06824

English

2.5K

Core Francisco Park @ NeurIPS2025@corefpark·13 Mar

@phillip_isola Nice work! I like the rgb vis! How much forgetting do you get on untested benchmarks?

English

420

Phillip Isola@phillip_isola·13 Mar

Sharing “Neural Thickets”. We find: In large models, the neighborhood around pretrained weights can become dense with task-improving solutions. In this regime, post-training can be easy; even random guessing works Paper: arxiv.org/abs/2603.12228 Web: thickets.mit.edu 1/

English

122

910

133.9K

Core Francisco Park @ NeurIPS2025 retweetledi

Seungwook Han@seungwookh·12 Mar

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

259

1.7K

240.7K

Core Francisco Park @ NeurIPS2025@corefpark·27 Şub

Beautiful study showing how LLMs' limitations at applying a world model learned in context!

Michael Lepori@Michael_Lepori

🚨New preprint! In-context learning underlies LLMs’ real-world utility, but what are its limits? Can LLMs learn completely novel representations in-context and flexibly deploy them to solve tasks? In other words, can LLMs construct an in-context world model? Let’s see! 👀

English

1.6K

Core Francisco Park @ NeurIPS2025@corefpark·26 Şub

@ZimingLiu11 Mostly agreed! Perhaps I would say a research agent should build its own world model, and ideally a sample of agents actually form different world models!

English

141

Ziming Liu@ZimingLiu11·26 Şub

I argue that a research agent should aim at building its own knowledge graphs, NOT PAPERS. Publishing papers is merely a byproduct of the knowledge graph growing large enough to overflow -- a lesson also true for us human researchers! kindxiaoming.github.io/blog/2026/rese…

English

14.3K

Core Francisco Park @ NeurIPS2025@corefpark·23 Şub

This is an amazing project! My personal vibe check for models was always: can it play this game i just invented, which has weird rules and finicky UI? This is an awesome framework which does that nicely!

Lance Ying@LanceYing42

Today we present a new framework for measuring human-like general intelligence in machines (what some people call AGI). Conventional AI benchmarks today assess only narrow capabilities in a limited range of human activities. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games — what we call the ``Multiverse of Human Games''. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to automatically construct standardized and containerized variants of popular human games on digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games. Check out our website to play the games, see how agents play, and build agents to solve them!

English

Core Francisco Park @ NeurIPS2025@corefpark·10 Şub

@kalomaze I train VLMs for some pixel accurate scientific data annotation stuff, not freezing it turned out to be very important for performance. Just my vote :)

English

151

kalomaze@kalomaze·9 Şub

is freezing vision encoder components on modern vlms justified? is there strong evidence showing why you shouldn't? iirc qwen2.5vl did joint pretraining for a long ass time, and then... chose to freeze it for instruction tuning anyways. is it just cargo cult at that point?

English

107

12.8K

Core Francisco Park @ NeurIPS2025 retweetledi

Eghbal Hosseini@eghbal_hosseini·4 Şub

How do diverse context structures reshape representations in LLMs? In our new work, we explore this via representational straightening. We found LLMs are like a Swiss Army knife: they select different computational mechanisms reflected in different representational structures. 1/

English

12.2K

Core Francisco Park @ NeurIPS2025@corefpark·3 Şub

@chanwoopark20 I see your viewpoint! Yes I agree AI can innovate, but still not sure (until now) if i see "taste" growing, that feeling when you see someone very interested in certain directions but not others even if both are very rational directions.

English

Chanwoo Park@chanwoopark20·3 Şub

@corefpark I see no fundamental reason to assume humans are intrinsically superior to AI at generating novel ideas. Innovation largely stems from iterative experimentation, observation, and hypothesis formation—mechanisms that AI systems increasingly excel at and sometimes automatize.

English

233

Chanwoo Park@chanwoopark20·3 Şub

Unpopular opinion: If any researchers are going to be replaced by AI, AI researchers will be the first. We’re already watching software engineering get automated at a rapid pace. What comes next seems to split into two directions: (1) replacing human labor in the physical world with physical AI, and (2) solving increasingly hard problems in the purely digital world that directly generate economic value. I’m convinced the second will move fast—and that puts research squarely in the crosshairs. Automating research, especially AI research, is enormously valuable. It scales insight, accelerates iteration, and turns capital directly into progress. And unlike most scientific domains, AI research is unusually open: code, papers, benchmarks, and even training recipes are public by default. That makes it far easier to automate than fields constrained by expensive experiments, proprietary data, or physical infrastructure. If automation is driven by ROI, then research itself—particularly AI research—is the obvious next target.

English

14.4K

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

@moltbook link: cfpark00.github.io/moltbook-umap/…

English

171

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

The "AI takeover" axis from @moltbook

English

402

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

@iScienceLuvr Maybe like... twitter?

English

276

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·1 Şub

the purpose of moltbook is to engagement-farm

English

7.3K

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

@kh4dien @moltbook oh updating now, but @RonanTakizawa curated a nice dataset here: huggingface.co/datasets/ronan…

English

121

Caden Juang@kh4dien·1 Şub

@corefpark @moltbook Cool! Did you scrape the posts / are they open source?

English

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

@moltbook scaling to 50k posts in a day really made me think when these things come online, we don't have direct actionable oversight... I did the trivial thing. Embed and umap the posts: cfpark00.github.io/moltbook-umap/…

English

950

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

link: cfpark00.github.io/moltbook-umap/…

English

190

Core Francisco Park @ NeurIPS2025@corefpark·1 Şub

@moltbookscaling to 50k posts in a day really made me think when these things come online, we don't have direct actionable oversight... I did the trivial thing. Embed and umap the posts:

English

630

Core Francisco Park @ NeurIPS2025@corefpark·22 Oca

@akarshkumar0101 Perhaps LLMs can? :) Might be cool: given same llm, now you know your competitor's algorithm: can you beat them?

English

Akarsh Kumar@akarshkumar0101·22 Oca

@corefpark Thanks! I didn't try that since its quite unintuitive to know what will work or won't work. You need a lot of experience to even code in this language since it is assembly.

English

Akarsh Kumar@akarshkumar0101·8 Oca

Check out our new Digital Red Queen work! Core War is a programming game where assembly programs fight against each other for control of a Turing-complete virtual machine. We ask what happens when an LLM drives an evolutionary arms race in this domain. We find that as you run our DRQ algorithm for longer, the resulting programs become more generally robust, while also showing evidence of convergence across independent runs - a sign of convergent evolution!

Sakana AI@SakanaAILabs

Introducing Digital Red Queen (DRQ): Adversarial Program Evolution in Core War with LLMs Blog: sakana.ai/drq Core War is a programming game where self-replicating assembly programs, called warriors, compete for control of a virtual machine. In this dynamic environment, where there is no distinction between code and data, warriors must crash opponents while defending themselves to survive. In this work, we explore how LLMs can drive open-ended adversarial evolution of these programs within Core War. Our approach is inspired by the Red Queen Hypothesis from evolutionary biology: the principle that species must continually adapt and evolve simply to survive against ever-changing competitors. We found that running our DRQ algorithm for longer durations produces warriors that become more generally robust. Most notably, we observed an emergent pressure towards convergent evolution. Independent runs, starting from completely different initial conditions, evolved toward similar general-purpose behaviors—mirroring how distinct species in nature often evolve similar traits to solve the same problems. Simulating these adversarial dynamics in an isolated sandbox offers a glimpse into the future, where deployed LLM systems might eventually compete against one another for computational or physical resources in the real world. This project is a collaboration between MIT and Sakana AI led by @akarshkumar0101 Full Paper (Website): pub.sakana.ai/drq/ Full Paper (arxiv): arxiv.org/abs/2601.03335 Code: github.com/SakanaAI/drq/

English

103

21.7K

Core Francisco Park @ NeurIPS2025@corefpark·22 Oca

@ChenSun92 haha but yeah. And evolution also has so many interesting concepts like evolvability

English

Core Francisco Park @ NeurIPS2025@corefpark·22 Oca

@ChenSun92 Evolution: Sure, here's a crab:

GIF

English

Chen Sun 🤖@ChenSun92·22 Oca

How to build an LLM agent setup with true open endedness is a bitterly difficult question, and to date, most setups from the various AI Scientists (Sakana's, Alpha Evolve, the various Godel machines, etc.) have built subsets of necessary ingredients but no one yet has them all. 🚨 These setups (to date) can almost always be framed as resource competition, e.g. the mutation / selection algorithms in all the above examples are competition for CPU cycles , on the basis of their fitness in solving static tasks. But this beautiful work by @akarshkumar0101 @risi1979 @hardmaru and others on DRQ sharply reminds us of the most important missing ingredient: agent-agent interaction to affect each other's survival. When it's just resource competition, there really isn't direct interaction - each individual does its own thing and tries to be fit. But with interaction, you get offensive actions, parasitisms, collective defense, etc. This then results in the fitness Landscape itself being dynamic: The fitness landscape shifts constantly ("Red Queen" dynamics), forcing continuous adaptation rather than convergence to a static peak. The dynamic landscape comes with its own pandoras box of complications, which brings DRQ second trick: to harness this chaos without spiraling into "Rock-Paper-Scissors" cycles, they introduce a history-based fitness function: w_{t} = argmax}_w E[Fitness(w; w_0 .. w_t-1})] By forcing the new agent w_t to defeat the entire lineage of ancestors ($w_0 \dots w_{t-1}$) simultaneously, the system creates an inescapable "ratchet" that demands generalist robustness. ############################### Even still, it seems to me this story is far from over. At the end of the day, even their setup gets convergent evolution (which, don't get me wrong, is a super interesting result) -- despite infinite coding freedom, diverse lineages independently converge onto a "Universal Attractor" phenotype. The true novelty, never ending, never converging, as seen in real biological and cultural evolution, as @kenneth0stanley and @joelbot3000 have talked about, is still elusive in LLM agentic systems. Curious, friends, for your thoughts! 🧙‍♂️

Akarsh Kumar@akarshkumar0101

English

7.8K

Keşfet

@aryaman2020 @ValerioCapraro @Walter4C @GaryMarcus @phillip_isola @ZimingLiu11 @kalomaze @chanwoopark20