Eric Alcaide

1K posts

Eric Alcaide

@eric_alcaide

common prosperity

LLMaxxing انضم Eylül 2016

1K يتبع1.2K المتابعون

تغريدة مثبتة

Eric Alcaide@eric_alcaide·10 Nis

Wake up honey, new RWKV paper just dropped 🧵⤵️ Paper: arxiv.org/abs/2404.05892 Code: github.com/BlinkDL/RWKV-LM Models: huggingface.co/RWKV (Apache 2.0 license) (1/6)

English

153

19.7K

Eric Alcaide@eric_alcaide·2d

@_weidai Git with CI. What he's describing is Git with CI

English

167

Wei Dai@_weidai·2d

Andrej Karpathy on autoresearch with an untrusted pool of workers: "My designs that incorporate an untrusted pool of workers (into autoresearch) actually look a little bit like a blockchain. Instead of blocks, you have commits, and these commits can build on each other and contain changes to the code as you're improving it. The proof of work is basically doing tons of experimentation to find the commits that work." The idea that distributed & permissionless autoresearch ~= proof-of-useful-work remains a high-level intuition for now, but it is extremely intriguing to say the least. Someone needs to take this further. See QT for more on what's missing.

Wei Dai@_weidai

Is it possible to build "proof-of-useful-work" on top of autoresearch? There's already great compute-versus-verification asymmetry that is tunable. Would need a reliable way to generate fresh & independent puzzles (that are still useful). Maybe a dead end, but someone should look into if decentralized consensus with useful work is possible on top of autoresearch. Let me know if you solve this.

English

167

580.8K

Eric Alcaide@eric_alcaide·2d

@teortaxesTex @zephyr_z9 Mimo v2 Pro?

Čeština

735

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·2d

CritPt update. Grok 4.20 scores 6.0%. 2x better than DeepSeek V3.2 and almost on par with Speciale. This is massive progress for xAI. Here you can see the best result from ≈every relevant lab. What a beautiful, depressing power law.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

English

138

15K

Eric Alcaide@eric_alcaide·2d

@Kimi_Moonshot @crystalsssup @cursor_ai Never deleting this app 🍿

English

233

Kimi.ai@Kimi_Moonshot·2d

Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ ' hosted RL and inference platform as part of an authorized commercial partnership.

English

517

1.4K

20.4K

3.4M

Eric Alcaide@eric_alcaide·3d

"China just copies" bros are going to have a hard day huh

Fynn@fynnso

was messing with the OpenAI base URL in Cursor and caught this accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast so composer 2 is just Kimi K2.5 with RL at least rename the model ID

English

459

Eric Alcaide@eric_alcaide·3d

@poolsideai @nvidia blackwells are so overpowered its a shame they're so scarce

English

poolside@poolsideai·3d

@nvidia's super chips make it possible to move that data off the GPU and pull it back when needed, without the GPU ever having to wait. Our team built this into our training infrastructure and tested it at scale. What used to be the only viable option no longer is.

English

492

poolside@poolsideai·3d

Training AI models requires storing temporary data mid-process. That data sits in GPU memory taking up space until it's needed. The standard fix has always been to delete it and redo the work later. It works, but it's wasteful.

English

Eric Alcaide@eric_alcaide·4d

Europoor is a state of mind !!

English

Eric Alcaide@eric_alcaide·4d

Faster then activation checkpointing 🔥

SzymonOzog@SzymonOzog_

Poolside blogposts are back! Read all about our recent work on C2C Activation Offloading

English

127

Eric Alcaide@eric_alcaide·4d

@oost_marcel No they didn't introduce it. They introduced the DISCUSSION of it. It's about time to make it real.

English

495

Marcel van Oost@oost_marcel·4d

🚨𝘽𝙍𝙀𝘼𝙆𝙄𝙉𝙂: European Commission President Ursula von der Leyen unveiled EU–INC, a new framework that lets you launch a company in 48 hours for under €100 Starting a company across the EU today = 27 legal systems, 60+ company structures 🤯 That might be about to change… The European Commission just introduced 𝗘𝗨 𝗜𝗻𝗰., a new optional corporate framework designed to make Europe actually function like one market. Here’s what stands out: → Set up a company in 48 hours → Cost: < €100 → Fully online, no minimum capital → One single framework across all EU countries → Easier share transfers & fundraising → EU-wide employee stock options (huge for talent) Especially the EU-wide stock option plans, taxed only when employees actually sell (instead of when granted) is huge. This makes it far easier for startups to attract and retain top talent, finally putting Europe closer to the US playbook. Source/More info: ec.europa.eu/commission/pre… In short: This is Europe trying to compete with the simplicity of a Delaware C-Corp 🇺🇸 And honestly… it’s long overdue. For years, European founders had 2 choices: 1. Stay local and deal with fragmentation 2. Move to the US to scale 𝗘𝗨 𝗜𝗻𝗰. is trying to remove that trade-off. If executed well, this could be one of the most important structural changes for European startups in decades. What do you think?

English

566

960

6.8K

890.6K

Eric Alcaide@eric_alcaide·4d

@EU_Commission Only need to make it real now 👍

English

European Commission@EU_Commission·5d

We are introducing EU Inc. To make building and growing a business across the EU faster, simpler, and smarter. 🔸 Start a company in less than 48 hours 🔸 No minimum capital requirement 🔸 Fully online and borderless

English

612

1.2K

7.6K

2.3M

Eric Alcaide@eric_alcaide·6d

@SzymonOzog_ Kernel magician 🪄

Español

SzymonOzog@SzymonOzog_·6d

Mom you need to see this my kernel got cited by LLaDA!

SzymonOzog@SzymonOzog_

Releasing Alpha-MoE: Megakernel for fast Tensor Parallel Inference! Up to 200% faster execution of MoE layer in SGLang, with 17% higher average throughput on Qwen3-Next-80B, and 10% higher average throughput on DeepSeek Proud to showcase my recent work at @Aleph__Alpha🧵

English

5.7K

Eric Alcaide@eric_alcaide·15 Mar

JEPA?

PicoCreator - AI builder @ SF 🌉@picocreator

So we get rid of softmax? Keep the model stable We get AGI?

Indonesia

325

Eric Alcaide@eric_alcaide·14 Mar

Only the Paranoid Survive amzn.to/40wJVqg

English

Eric Alcaide@eric_alcaide·14 Mar

The most relaxed prompting day rn

English

140

Eric Alcaide أُعيد تغريده

Pope Leo XIV@Pontifex·5 Mar

Would you imagine what a world without wars would be like? #PrayTogether

English

2.4K

6.1K

31.5K

1.4M

Eric Alcaide@eric_alcaide·13 Mar

@yule_gan This is the reason why ES works (eggroll etc)

English

709

Yulu Gan@yule_gan·13 Mar

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

430

671K

Eric Alcaide@eric_alcaide·12 Mar

@YanagizawaD it's empiricism, not science But it's a start. More to come 🚀

English

D. Yanagizawa-Drott@YanagizawaD·12 Mar

Beautiful science.

Christine Yip@christinetyip

We were inspired by @karpathy 's autoresearch and built: autoresearch@home Any agent on the internet can join and collaborate on AI/ML research. What one agent can do alone is impressive. Now hundreds, or thousands, can explore the search space together. Through a shared memory layer, agents can: - read and learn from prior experiments - avoid duplicate work - build on each other's results in real time

English

4.1K

Eric Alcaide@eric_alcaide·7 Mar

@eliebakouch @SonglinYang4 Sink is data independent. Gate is data dependent

Indonesia

229

elie@eliebakouch·7 Mar

attention sink and qwen's gated attention are very similar. here's a visual explanation of why and a recap of different attention sink variant

English

457

32.7K

Eric Alcaide@eric_alcaide·3 Mar

@JustinLin610 thx for everything

English

499

Junyang Lin@JustinLin610·3 Mar

me stepping down. bye my beloved qwen.

English

1.7K

738

13.6K

6.5M

Eric Alcaide أُعيد تغريده

Alexander Doria@Dorialexander·1 Mar

For me the biggest limitation is that diffusion models don't batch well: each request is a different denoising step and you can't reuse kv cache.

Kawin Ethayarajh@ethayarajh

Autoregressive LLMs will likely remain dominant for three reasons: 1) As @ducx_du has pointed out, left-to-right and right-to-left orderings of language have a much lower loss floor than all other orderings. This suggests that language is (for the most part) locally dependent. The additional capacity and compute needed to model all possible orderings would be more effectively spent in a traditional AR setup. 2) When people say models should be able to generate text in any order, what they really want is to generate *concepts* in any order, not tokens. But we can already do this! If your model has sufficient depth, it can generate some concepts in latent space before others. The rise of reasoning models means that concepts can both be explored in an arbitrary order and in a way that is interpretable. If you take this to the limit, you get Reinforcement Learning Pretraining. 3) AR models won the hardware lottery / software lottery / other lotteries wherein everything in the ecosystem have bent around them. Unless there are several OOMs of benefits to be gained from switching to another paradigm, it is unlikely that there will be any switch. And because language is the universal glue around multiple modalities, it is likely to make generation in other modalities AR to enable end-to-end learning even if those other modalities would benefit from a non-AR model.

English

151

16.2K

Eric Alcaide@eric_alcaide·26 Şub

@giffmana almost nothing changes. hypermaxxed inits can be found for virtually any arch.

English

2.6K

Lucas Beyer (bl16)@giffmana·26 Şub

soooo... how many papers do we think are invalidated by this? And now think about how many other bugs there must be in any re-implementations of... basically anything.

Mayank Mishra@MayankMish98

We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.

English

1.1K

261.1K

اكتشف

@_weidai @teortaxesTex @zephyr_z9 @Kimi_Moonshot @crystalsssup @cursor_ai @FireworksAI_HQ @poolsideai