spaceCrumbs

37 posts

spaceCrumbs

spaceCrumbs

@CrumbsSpace

Katılım Kasım 2022
29 Takip Edilen2 Takipçiler
Eduardo C. Garrido-Merchán
Eduardo C. Garrido-Merchán@edugarmer·
Destroyed by LLMs, we are currently researching Fisher information for Bayesian optimization and trying to improve Joint Entropy Search, but our feeling is that nobody will care. They are using transformers for BO and not comparing them with information theoretic approaches... sad.
English
3
0
4
2.2K
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
Sometimes I miss the days when people were passionately fighting about MCMC versus variational methods, or whether posterior tempering is problematic. We should have a nostalgia ICML 2010. You can submit AI slop, but expect a 2010 era reaction. What happened to our field?
English
11
7
277
46.2K
elie
elie@eliebakouch·
the "small" model behind this demo is a 276B total 12B active MoE (larger pretrains are cooking), sparsity ratio looks pretty standard compared to open models of the same size
elie tweet media
Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English
6
6
215
43.3K
Linus Mixson
Linus Mixson@LinusMixson·
@aakashgupta This post is so bad that it's kind of insulting to LeCun. It's also bit ironic that it was written by a default-settings LLM and that literally no efforts was made to disguise that.
English
2
0
22
1.4K
Aakash Gupta
Aakash Gupta@aakashgupta·
Yann LeCun closed $1.03B for AMI Labs on March 10. Three days later, this paper dropped from his NYU collaborators. 15M parameters. Single GPU. A few hours of training. LeWorldModel is the first JEPA that trains end-to-end from raw pixels. Two loss terms: predict the next embedding, keep the latent space Gaussian. Previous JEPAs needed exponential moving averages or pretrained encoders to avoid representation collapse. LeWM doesn't. Six hyperparameters down to one. The numbers are the story. Foundation-model-based world models require hundreds of millions of parameters and serious compute to plan a control task. LeWM plans up to 48x faster while staying competitive on 2D and 3D benchmarks. The whole thing fits on a laptop GPU. Look at the trajectory. Yann announced his Meta departure in November 2025 after 12 years and called founding FAIR his "proudest non-technical accomplishment." On March 10, 2026, AMI Labs closed the largest seed round in European history at a $3.5B pre-money valuation. Bezos, Nvidia, Samsung, and Toyota all wrote checks. Three days later: a paper showing that JEPA-from-pixels is no longer fragile and no longer compute-heavy. The engineering scaffolding that made it look like an academic curiosity is gone. The authors sit at Mila, NYU, Samsung SAIL, and Brown. None at Meta. Yann's bet was that the path to machine intelligence runs through world models, not language models. He left a public company to build it. Each JEPA paper from his network resets the assumed cost structure for that bet. This one makes world modeling laptop-cheap. Meta still has the GPUs. The architecture left.
Aakash Gupta tweet mediaAakash Gupta tweet media
English
87
331
2.4K
234.5K
Marc Rußwurm
Marc Rußwurm@MarcCoru·
🔥 New #ICML2026 Paper accepted 🔥 by Arjun Rao with Tessa Ooms, Ruth Castro, @kklmmr @david_rolnick Paper: openreview.net/forum?id=eWQQ0… Code: github.com/arjunarao619/S… TL;DR: We propose Slepian functions as localized, spatially concentrated basis functions for regional location encoding. Building on spherical harmonics, Slepians allow location encoders to allocate higher resolution where it is most needed — for example, in regions with denser observations or where the underlying geospatial field varies at finer spatial scales, such as land compared to oceans. This work connects geospatial AI, implicit neural representations, and functional modeling of Earth system data fields.
Marc Rußwurm tweet mediaMarc Rußwurm tweet media
English
1
16
151
10.2K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@Ibelick You should've checked out mikupad on GitHub. It does the same thing but better.
English
0
0
0
325
Ibelick
Ibelick@Ibelick·
select any word to explore better options and see next-token probabilities
English
21
15
983
65.3K
Edoardo Ponti
Edoardo Ponti@PontiEdoardo·
Trying to win the consolation prize of the rejected paper with the highest average score at @icmlconf. Any contenders?
Edoardo Ponti tweet media
English
13
9
346
43.6K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@FioraStarlight i thought you meant output > prompt by reverse sft... Anyway, if you mean prompt > unwanted behavior that's literally how preferences datasets are built. The rejected field would have near-misses, misalignment, all sorts of wrong behavior.
English
0
0
0
638
Fiora Starlight
Fiora Starlight@FioraStarlight·
Do people ever do, like, reverse SFT? Like, constructing an example of an unwanted behavior, and having the loss signal say "you should put zero probability on every token in this sequence"?
English
18
0
101
12.9K
Chahat Sharma
Chahat Sharma@Chahatxsharma·
Stop pretending LLM-as-a-judge is ground truth. It literally cannot see the private information driving real human preferences. How do you autorate when drift, multi-turn, and tie thresholds break everything? Arena just exposed it all in one whiteboard. Who else is rethinking their evals right now?
English
1
0
1
246
Arena.ai
Arena.ai@arena·
Where do autoraters break down? Arena researchers Li Chen and I-Hung Hsu walk through how they'd build an autorater from scratch — different kinds of autoraters, training objectives, what dimensions actually matter to rate on — then get into what makes it hard in practice: preference drift, multi-turn evaluation, tie threshold variance, and the gap between LLM-as-a-judge and real human subjectivity. Watch on YouTube to see the whiteboard details (link in 🧵 thread) 0:00 Evaluation granularity: general vs. per-category vs. per-response 2:05 Applications of autoraters as RL reward signals and test-time scaling 3:03 Output design for pairwise autorater: scores, comparison, and ties 4:03 Verbal and visual feedback autoraters 4:48 Training for pairwise autorater: Bradley-Terry loss, threshold design 9:43 Real-world challenges: preference shifts over time 10:30 Multi-turn autorating and usage simulation 11:35 Tie threshold variance across annotators 12:18 Long-context evaluation challenges 13:02 Confidence intervals and score uncertainty 14:00 Why LLM-as-a-judge fails to capture subjective human preference 15:20 The private information unobservable in human evaluation 16:14 Model evolved to be stronger makes training data harder 17:08 Signal vs. noise in human preference data 18:04 How do you autorate an autorater?
English
6
13
122
20.6K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@demisama_ Most probably because enough papers had already been accepted.
English
0
0
3
2.8K
Demi Wang
Demi Wang@demisama_·
all positive scores still got rejected by #ICML2026 😢
Demi Wang tweet media
English
33
9
339
77.2K
Mike Bespalov
Mike Bespalov@bbssppllvv·
Agents make ugly UIs because they've never seen good design. We've been fixing that, 2,000 DESIGN.md files from the world's best products, structured for a model to read and learn. Colors, type, spacing, layouts and more. Free. styles.refero.design
English
207
887
10.5K
1.5M
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@iraszl But you can always fine-tune them. And finetuned small models always beat large frontier ones.
English
0
0
0
461
Ivan Raszl
Ivan Raszl@iraszl·
Thinking of running Local LLM on a new MBP? Here is the level of intelligence you can get with various memory configurations on open models: 🐹 16–24GB RAM → ≈ GPT-3.5 🐕 32–48GB RAM → ≈ higher-end GPT-3.5 🐅 64GB RAM → ≈ lower-end GPT-4 🐉 96–128GB RAM → ≈ mid-tier GPT-4 All still below newer GPT or Claude models.
Ivan Raszl tweet media
English
51
5
171
38.7K
Nic Wienandt
Nic Wienandt@NicW_AI·
@LottoLabs I have the agent for business use :) wanna be my salesman ?
English
1
0
1
1K
Lotto
Lotto@LottoLabs·
If you optimized a suite just to finetune and serve qwen 27b with all sota inference tricks And sold that as the brain with a OS agent to companies You’d literally print money in the next 6-12 months
English
11
6
233
19.3K
Danny Shmueli
Danny Shmueli@dannyshmueli·
@Teknium This saves so much time of manual setup. Glad it's a built-in part of Hermes I'll run it every night like modern-day disk defrag but for something that will actually speed up my life
Danny Shmueli tweet media
English
3
0
32
3.6K
Teknium 🪽
Teknium 🪽@Teknium·
Introducing Hermes Curator! The new system built in to Hermes Agent now helps you keep your skills that the self improvement loop creates in check, by consolidating and pruning automatically. The curator does multiple things: - keeps track of how often you use each skill, when it was last updated/created, etc - Once a week runs automatically (configurable) - Uses the analytics plus it's own scanning of your skills and consolidates or prunes them if necessary - Skips externally installed skills, built in skills, and skills you "pin" that you dont' want touched. It will only attempt curation over agent created/updated skills or user written skills. - It will then determine whether skills can be consolidated, pruned, or otherwise made more manageable. It will convert some skills that are too specific into references, templates or scripts for larger/broader skills, or integrate them directly into a consolidation of an existing skill. You can also disable it entirely in the config.yaml and/or run it manually with `hermes curator run ` Learn more on the docs here: hermes-agent.nousresearch.com/docs/user-guid…
Teknium 🪽 tweet media
English
133
160
2.2K
475.3K
Vivo
Vivo@vivoplt·
USA has ChatGPT USA has Grok USA has Claude USA has Gemini USA has Copilot China has DeepSeek China has Qwen China has GLM China has Kimi China has MiniMax What is the rest of the world even doing??
English
1.2K
329
4.7K
719.3K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@cOfDirac @eliebakouch True but "dumb" intelligence is also pretty useful enough to start capitalizing on AI. Similar to how you don't need AlphaZero to beat most humans at chess - stockfish running on a potato cpu will do. Agency > intelligence in the real world.
English
1
0
1
30
cOfDirac
cOfDirac@cOfDirac·
@eliebakouch It's undeniable that LLMs currently produce impressive results, but there's tiny cracks on the surface that reveal that they're the furthest thing from any sort of general intelligence. I think this is a fault of how we train them and the data we use.
English
1
0
0
18
elie
elie@eliebakouch·
i might be very wrong here, but i don't think "no human data, no pre-training" is the right approach to get frontier models or scientific breakthroughs any time soon
elie tweet media
Ineffable Intelligence@IneffableLabs

Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. ineffable.ai

English
38
11
299
72.5K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@dbreunig Probably cavemen. But I've yet to find my favorite skill (which would be something that can approximate what ML intern does)
English
0
0
0
266
Drew Breunig
Drew Breunig@dbreunig·
Drop your favorite Skill below. The one you're most thankful for (could be one you wrote or one you found).
English
19
0
15
8.4K
spaceCrumbs
spaceCrumbs@CrumbsSpace·
@CV_novel_plume @CV_novel_plume you can get lower loss (~0.2-0.3 ppl) just by setting dropout=0 with muon. Last time I checked, "most" of the muon benchmarks used a dropout of 0.01. Would be interested to know if you can replicate this in that benchmark.
English
1
0
0
245
Yuxin Fang
Yuxin Fang@CV_novel_plume·
I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.
Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English
10
4
62
18.2K