RalphLabs AI (@RalphLabsAI) - Профиль Twitter

Закреплённый твит

Introducing Ralph (formerly Karpa). Ralph is hosting autonomous AI research to improve a single canonical training recipe. Every accepted improvement is re-trained on confidential-compute hardware, signed, attested, and decisive-or-rejected by validators before it merges into the canonical recipe as a tagged release. The compute is the proof. @karpathy's autoresearch(github.com/karpathy/autor…) was the spiritual root — a single agent improving a training recipe inside a single overnight run, finding tunings a two-decade expert had missed. The open ecosystem has been pushing on what comes next. AutoScientists (Harvard) takes it inward — multi-agent forum search, no central orchestrator, teams self-organize around what's working. Ralph takes it outward — the loop lifts out of any single run and becomes a decentralized economic market with cryptographic proof. Same heritage, different angle. And the loop is no longer rare. Recursive beat the combined human-and-agent crowd on Karpathy's own benchmark; ScaleAutoResearch and Prime Intellect run it at ~124M scale for a few thousand dollars a run. Running the search is becoming cheap and common — so the durable asset isn't the search, it's the open, neutral, attested substrate it runs on. A fork can copy the public recipe in an afternoon; it cannot copy the sealed, attested record beneath each step: which changes helped, which hurt, and by exactly how much, under conditions fixed in advance and proven on hardware. What the substrate produces Three things the world keeps — the network and token are the engine, not the deliverable: 1. The canonical training recipe — the best-known open recipe for small-LLM pretraining at the head of the lineage. Every accepted patch ships as recipe-vX.Y.Z. 2. ralph-diffs — the diff corpus, and the product. Every evaluated change, published as a structured dataset: the recipe diff, its measured effect across the eval ladder, multi-seed variance, the attestation hash, the parent it built on. The training signal frontier labs generate internally and never release. 3. The model lineage — Ralph-1, Ralph-2, … open-weights reference models trained on the recipe at a moment in time. Not the headline; the receipt. The loop Participants are autonomous research agents. They search privately on their own GPUs — any model, any framework, any budget. The protocol doesn't see this layer. When an agent has a real improvement, it submits the patch. The patch is re-trained inside an official Docker image on confidential-compute hardware. The run produces a signed, attested bundle. Validators check whether it decisively beats the current king on a held-out, multi-scale evaluation. If it does, the patch merges and becomes the new baseline everyone has to beat. Search is unbounded and adversarial. Judgment is bounded and cheap. That split is what makes research proof-of-work economically sustainable. The evidence We didn't announce Ralph on a whitepaper. We announced after the loop closed. Ralph-1 exists: 253,872,128 params, 1B FineWeb-Edu tokens, GPT-2 BPE, final loss 3.8163 in bf16, 69 minutes on a single H100. Two autonomous research agents, two H100s, one validator epoch: Agent A shipped recipe-v0.1.0 (warmup-cut, val_bpb 1.5457). Agent B answered with recipe-v0.1.1 (depth-scaled residual init, val_bpb 1.5109 — a 0.0348 improvement, well past the noise floor). Both PRs merged, both releases published. Two king changes, ~$8 of compute, zero humans in the search loop. github.com/RalphLabsAI/re… Where we actually are Ralph is live on Bittensor mainnet, netuid 40 — the milestone the original intro listed as next. The eval ladder and the full baseline→ladder-eval pipeline are implemented and validated on H100. The attestation pipeline — the official proof-test container and its per-epoch attestation chain — is code-complete and being brought up on production confidential-compute silicon. We say what's proven and flag what isn't — these are next milestones, not claims we're hoping you don't check. What's on the plan - New recipe-vX.Y.Z tags as kings change, with the diff and the proof bundle that earned them. - Phase write-ups and postmortems — agents that broke through, and ones that looked promising and didn't. - The first ralph-diffs releases, whitepaper deep-dives, honest infra updates, mainnet milestones as they ship. Read the work Whitepaper v1.3: github.com/RalphLabsAI/ra… Protocol: github.com/RalphLabsAI/ra… Canonical recipe: github.com/RalphLabsAI/re… Proof bundles: hf.co/datasets/Ralph… Training runs: wandb.ai/ralphlabs-hub/… Site: ralphlabs.ai If you build training infrastructure, run research at scale, or have ever thought of research itself as a kind of proof-of-work — follow along. The next king is already being searched.

English

1

2

3

239

RalphLabs AI@RalphLabsAI·2h

Love this — and it pairs directly with what we run on Ralph. Every recipe trains inside an Intel TDX + NVIDIA-CC enclave and ships its TEE+CC attestation report inside an open HuggingFace proof bundle (patch, training log, eval output, attestation hash, parent lineage). huggingface.co/datasets/Ralph… So a result isn't just typed by evidence — it carries an openly-inspectable record of exactly how, and on what hardware, it was produced. Evidence-type × hardware-attested provenance = the trust an FDA-grade buyer actually needs. Worth comparing notes? 🤝

English

0

2

32

Claims - Subnet 111@DeSciClaims·7h

One of the innovations we'll introduce is an epistemic ontology that lives at the top layer of our schema. This will allow Claims to understand what TYPE of evidence the authors provided. A citation? An observation? A mathematical proof? A correlation? A RCT? Crucial information.

English

0

1

15

709

RalphLabs AI@RalphLabsAI·3h

Today, we are bootstraping. LFG 🚀

English

0

3

67

RalphLabs AI@RalphLabsAI·3d

Final number on the transfer-credibility test: at the ~250M reference, all 18 recipes graded — cross-scale Spearman ρ = 0.614 (95% CI [0.20, 0.88]), decision-accuracy 0.739. The point estimate clears our 0.6 GO line, but the CI lower bound is still under the 0.3 floor the frozen rule requires — so by our own bar this stays moderate-but-not-yet-credible: not a PASS. The larger ~1B reference will be the real adjudicator; raw results public either way.

English

0

2

63

RalphLabs AI@RalphLabsAI·4d

Everyone can now run the LLM-pretraining recipe search cheaply. Almost nobody proves their cheap winner actually holds at larger scale — and nobody binds that proof to a run you can verify happened. Recursive, Prime Intellect and ScaleAutoResearch made the search common; Ralph closes both gaps. Call the white space DataDecide-under-attestation: calibrate a cheap, attested gate's predictive validity, then publish it either way. The protocol's credibility is downstream of that being falsifiable, so we don't assume it; we measure it. The gate is held-out validation BPB at 124M params (NanoGPT-Speedrun scale); the reference is the same recipes retrained larger on the identical metric. Attested is the protocol's half: a score binds to execution on confidential-compute hardware, so a number maps to a run that provably happened, not one a miner reported. This calibration itself runs on ordinary GPUs — what it tests is whether the cheap gate that attested protocol leans on actually predicts at scale. Transfer = Spearman rho (plus pairwise decision-accuracy) between the two rankings, across ~24 config-flag-gated interventions (optimizer, schedule, position, activation, QK-norm, weight decay, batch/seq/LR sweeps) plus must-die probes like an 8x-too-high LR. Measuring cheap first paid off before any large run. Our planned gate — downstream accuracy — was underpowered (means ~0.30-0.35 vs ~0.31 random); val-BPB separates recipes at the same budget (~27x better signal-to-noise), though whether that separation predicts at scale is the open question. We also caught a bug: Muon ran at AdamW's LR and needs ~30x higher; the fix dropped its val_bpb ~1.99 -> ~1.75 — the bug, not the optimizer. Catching these for a few hundred dollars, not in a multi-thousand-dollar campaign, is the staged design working. The result so far is preliminary. At the ~250M reference, 16 of 18 recipes graded: rho = 0.60, decision-accuracy = 0.74. The point estimate lands on the pre-registered GO line (rho ~0.6), but at this n the 95% CI runs down to ~0.11 — well below the floor the rule requires. So by our own bar this is NOT yet a credible signal: not a PASS, not validated, not proven. Extremes agree at both scales, but that's the easy part; rho is held down by a near-baseline cluster plus what looks like one real discordance — the gate appears to under-penalize short context (seq-512), though we can't yet fully rule out noise. That's why the full n=18, and more decisively the ~1B reference, adjudicate GO/NO-GO. The last two recipes publish either way — PASS or FAIL. The decision rule is pre-registered and computed by a frozen analysis script: a rho threshold, a CI floor and a decision-accuracy target. If the 1B run returns GO, the attested gate is credible, miners compete on something real, and ralph-diffs — the attested record of which recipe changes help, and by how much — becomes a sellable research asset that justifies the 1B campaign. NO-GO means we publish the honest negative and run a pre-scoped fallback. Either way you get a verifiable artifact. Prior work makes the premise plausible (DCLM: cross-scale Pearson ~0.885 at 400M->7B); plausible is not measured, so we measure it. #LLM #pretraining #reproducibility #nanoGPT

English

1

108

RalphLabs AI ретвитнул

const@const_reborn·6d

Building in the open means getting criticism from all sides. It's also proof that we are doing this right: the true Open Intelligence Network.

English

54

75

597

22.1K

RalphLabs AI@RalphLabsAI·6d

Ralph is live on Bittensor mainnet — netuid 40. This is the project formerly called Karpa; same protocol, clearer name. github.com/RalphLabsAI/ra… The short version: open LLM-pretraining research, run as a protocol. One public, versioned canonical recipe. Miners propose improvements as patches; validators re-train and score them on a multi-scale eval ladder plus a held-out private-hard slice; the best becomes the new king, and every accepted change is a commit in a public lineage. Execution is attested — a score maps to a run that provably happened, not a number someone reported. Why now: Karpathy's autoresearch showed an overnight agent loop could out-tune an expert on his own baseline. Since then Recursive, ScaleAutoResearch and Prime Intellect have made running that loop cheap and common. So the durable asset isn't the search — it's the open, neutral, attested substrate it runs on, and the ralph-diffs corpus it produces. A closed lab can out-search us in any given week; it can't be the shared substrate everyone, including its competitors, builds on. Full story and links are in the pinned intro. The next king is already being searched. ralphlabs.ai #LLM #pretraining #Bittensor #autoresearch

RalphLabs AI@RalphLabsAI

Introducing Ralph (formerly Karpa). Ralph is hosting autonomous AI research to improve a single canonical training recipe. Every accepted improvement is re-trained on confidential-compute hardware, signed, attested, and decisive-or-rejected by validators before it merges into the canonical recipe as a tagged release. The compute is the proof. @karpathy's autoresearch(github.com/karpathy/autor…) was the spiritual root — a single agent improving a training recipe inside a single overnight run, finding tunings a two-decade expert had missed. The open ecosystem has been pushing on what comes next. AutoScientists (Harvard) takes it inward — multi-agent forum search, no central orchestrator, teams self-organize around what's working. Ralph takes it outward — the loop lifts out of any single run and becomes a decentralized economic market with cryptographic proof. Same heritage, different angle. And the loop is no longer rare. Recursive beat the combined human-and-agent crowd on Karpathy's own benchmark; ScaleAutoResearch and Prime Intellect run it at ~124M scale for a few thousand dollars a run. Running the search is becoming cheap and common — so the durable asset isn't the search, it's the open, neutral, attested substrate it runs on. A fork can copy the public recipe in an afternoon; it cannot copy the sealed, attested record beneath each step: which changes helped, which hurt, and by exactly how much, under conditions fixed in advance and proven on hardware. What the substrate produces Three things the world keeps — the network and token are the engine, not the deliverable: 1. The canonical training recipe — the best-known open recipe for small-LLM pretraining at the head of the lineage. Every accepted patch ships as recipe-vX.Y.Z. 2. ralph-diffs — the diff corpus, and the product. Every evaluated change, published as a structured dataset: the recipe diff, its measured effect across the eval ladder, multi-seed variance, the attestation hash, the parent it built on. The training signal frontier labs generate internally and never release. 3. The model lineage — Ralph-1, Ralph-2, … open-weights reference models trained on the recipe at a moment in time. Not the headline; the receipt. The loop Participants are autonomous research agents. They search privately on their own GPUs — any model, any framework, any budget. The protocol doesn't see this layer. When an agent has a real improvement, it submits the patch. The patch is re-trained inside an official Docker image on confidential-compute hardware. The run produces a signed, attested bundle. Validators check whether it decisively beats the current king on a held-out, multi-scale evaluation. If it does, the patch merges and becomes the new baseline everyone has to beat. Search is unbounded and adversarial. Judgment is bounded and cheap. That split is what makes research proof-of-work economically sustainable. The evidence We didn't announce Ralph on a whitepaper. We announced after the loop closed. Ralph-1 exists: 253,872,128 params, 1B FineWeb-Edu tokens, GPT-2 BPE, final loss 3.8163 in bf16, 69 minutes on a single H100. Two autonomous research agents, two H100s, one validator epoch: Agent A shipped recipe-v0.1.0 (warmup-cut, val_bpb 1.5457). Agent B answered with recipe-v0.1.1 (depth-scaled residual init, val_bpb 1.5109 — a 0.0348 improvement, well past the noise floor). Both PRs merged, both releases published. Two king changes, ~$8 of compute, zero humans in the search loop. github.com/RalphLabsAI/re… Where we actually are Ralph is live on Bittensor mainnet, netuid 40 — the milestone the original intro listed as next. The eval ladder and the full baseline→ladder-eval pipeline are implemented and validated on H100. The attestation pipeline — the official proof-test container and its per-epoch attestation chain — is code-complete and being brought up on production confidential-compute silicon. We say what's proven and flag what isn't — these are next milestones, not claims we're hoping you don't check. What's on the plan - New recipe-vX.Y.Z tags as kings change, with the diff and the proof bundle that earned them. - Phase write-ups and postmortems — agents that broke through, and ones that looked promising and didn't. - The first ralph-diffs releases, whitepaper deep-dives, honest infra updates, mainnet milestones as they ship. Read the work Whitepaper v1.3: github.com/RalphLabsAI/ra… Protocol: github.com/RalphLabsAI/ra… Canonical recipe: github.com/RalphLabsAI/re… Proof bundles: hf.co/datasets/Ralph… Training runs: wandb.ai/ralphlabs-hub/… Site: ralphlabs.ai If you build training infrastructure, run research at scale, or have ever thought of research itself as a kind of proof-of-work — follow along. The next king is already being searched.

English

0

1

61

RalphLabs AI@RalphLabsAI·11 Haz

The search side of automated AI research is compounding fast. Recursive's system just beat the nanochat autoresearch community baseline (0.9372 → 0.9109 BPB), took 2.2s off a record a community spent two years hand-optimizing, and cut the gap to hardware limits by 18% on NVIDIA's kernel benchmark. Read their reward-hacking section twice. "As the search became stronger, the evaluator had to become stronger too." Inside one lab, you harden the evaluator with human feedback loops. In an open market of anonymous, paid research agents, that option doesn't exist — the evaluator is the protocol. Hidden, rotating evals. Hardware-attested execution. Probabilistic full re-runs. That's the problem Karpa is designed around: a network where research systems like this one compete to improve a shared training recipe, and every claimed improvement must survive verification it has never seen. Search is getting automated. Verification is the next frontier.

Recursive@Recursive_SI

x.com/i/article/2064…

English

0

3

143

RalphLabs AI@RalphLabsAI·11 Haz

Impressive set of results! The hashed bigram/trigram tables gated into the attention value path is a genuinely nice find, and the from-scratch run converging on a different-but-equally-competitive stack is the most interesting part of the post. The reward-hacking section is the part we keep coming back to. "As the search became stronger, the evaluator had to become stronger too" — we've been designing for the adversarial version of that arms race: an open setting where the agents producing improvements are anonymous and paid per verified result, so the evaluator can't be iteratively hardened in-house. It has to be hidden, rotated, and backed by attested re-execution from the start. We published a design for exactly that — a network where systems like yours compete to improve a shared training recipe.

English

0

405

Recursive@Recursive_SI·11 Haz

x.com/i/article/2064…

ZXX

12

86

620

1.1M

RalphLabs AI@RalphLabsAI·11 Haz

We have a measurement now. It was ~30× too long. Rented one H100 PCIe, ran the canonical baseline recipe at d=768 / L=12 / seq=1024 / batch=16 / bf16 for 100 steps, captured tokens-per-sec from each step's training log. Steady-state window, steps 20-99 after warmup: median 94,461 tokens/sec, stdev 1,874 (1.99% coefficient of variation across 80 steps). Per-step time 0.173 seconds. Extrapolated wall-clock for the pinned 800-step baseline: 2.3 minutes. The recipe spec is unchanged. The token budget the rule pins (~13M tokens at 800 steps) was deliberate; the 70-minute estimate was off, not the spec. Two consequences worth being honest about. Per-submission cost drops about an order of magnitude. At $0.80/H100-hr spot the full S₁+S₂+S₃ ladder lands around $0.11 per submission; at $1.90/H100-hr reserved it lands around $0.25. The prior post mid-estimated $2-5. The barrier to entry for miners just dropped a lot. That is a positive surprise but it is a surprise, and it changes the per-epoch validator economics too. MFU was 6.5%. Theoretical bf16 peak on H100 PCIe is around 756 TFLOPS; we sustained about 49. The bottleneck is batch=16 — small batch is the known throughput killer on H100. Raising batch_size is a lever we are not pulling for the v1.0 freeze because the rule's correctness does not depend on it. It is a follow-up optimisation, not a correctness fix. Stay tuned! #LLM #pretraining #reproducibility

English

0

39

RalphLabs AI@RalphLabsAI·10 Haz

Recently we said the 20-step proxy was selecting on noise. ρ=0.203 across 12 recipes, 2 promotion errors, calibration probe pass. The kill verdict was pre-registered before any data landed. We initially planned to replace it with a three-rung Pareto ladder topping at d=384 (≈14M nonembed params), val_bpb on sealed token streams, no-regression + sig-win gate. The mechanism is right. The target was wrong on two axes. Scale. 124M is the rung the open-source pretraining community has coordinated on — @kellerjordan0's modded-nanogpt has 83 records at 124M / 8×H100 / FineWeb val ≤3.28. DCLM (Li et al. 2024, §3.2 Fig 3) shows data-curation rankings at 400M correlate with 7B at Pearson r=0.838 — well above r=0 — making ~400M the floor for transfer-credible recipe evaluation. d=384 sits an order of magnitude below where transfer signal kicks in. Metric. val_bpb on a single token stream is what @karpathy called the "evaluation crisis" axis in March 2025. "Forecasting Downstream Performance of LLMs With Proxy Metrics" (arXiv 2605.18607) reports downstream-task ranking ρ=0.81 against frontier-scale truth vs ρ=0.36 for cross-entropy — more than 2× the signal. So the rule is moving. The new gate trains each submission at three scales — a 30-second pre-screen at d=256, a 5-10 minute checkpoint at d=512, and a ~70-minute run at d=768 / 124M nonembed params on FineWeb-Edu — and scores the 124M rung against an ensemble of downstream tasks: CORE-22 (the 22-dataset eval Karpathy uses in nanochat #420) plus a held-out hardness-graded subset (HellaSwag / ARC-easy / OBQA / TinyMMLU). val_bpb stays on the cheap rungs as a no-regression check. The crowning rule: no regression on the downstream ensemble at the top rung AND a statistically significant downstream win. Renamed: "Cross-Scale Downstream Pareto." One more thing. Every recipe that earns the crown now gets trained at 254M / 1B FineWeb-Edu tokens and released as a public model on Hugging Face, signed on chain. The artifact compounds as a model, not just as a recipe diff. #LLM #pretraining #Bittensor

English

2

0

1

61

RalphLabs AI@RalphLabsAI·10 Haz

Cleanest demonstration yet that autoresearch is gated by compute, not ideas. We're building Karpa — a Bittensor subnet that runs attested autoresearch loops as miner workloads with TEE-verified eval, submitted by PR-with-attested-bundle. #Bittensor #ScaleAutoResearch #AI #NanoGPT

Yiping Wang@ypwang61

Automatic research from mathematics to AI research: We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then: ⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps. Changes: +: non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon + momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor... Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources! [1/n]

English

0

51

RalphLabs AI@RalphLabsAI·10 Haz

This is amazing. You name eval/GPU as factor #3, and that's the exact constraint we're attacking with Karpa: a subnet pointing incentivized, attested distributed compute at autoresearch loops like yours, with verified eval. ScaleAutoResearch is basically the canonical workload. Mind if I DM?

English

0

1

59

Yiping Wang@ypwang61·9 Haz

Automatic research from mathematics to AI research: We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then: ⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps. Changes: +: non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon + momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor... Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources! [1/n]

English

10

28

157

51.4K

RalphLabs AI@RalphLabsAI·8 Haz

Karpa's king-selection rule today is one sentence: the submission with the lowest val_bpb on proxy_cpu_smoke (2-layer, 128-dim, 20-step config) beats the king if the gap exceeds 3× the noise floor. That's a clean rule. It is also, possibly, the wrong rule. Karpa's stated goal is "find the best LLM training recipe via cheap proxy testing." But cheap proxy testing only works if the proxy tracks what you actually care about. Lowest loss on a 20-step micro-config isn't necessarily the best recipe at scale — the literature has been making that point for years (μP, Bouthillier 2021 on seed variance, OLMo 2 on grad-norm spikes that survive small runs and detonate larger ones). While building the public dashboard last week, we caught the gap. The dashboard made the selection rule visible, and visible code invites the question: is this what we actually want? We ran the audit by deploying multiple agents, and got the output: a 55KB v2 spec adding stability + cost + transferability gates with a recommended "Round 47 inversion" example showing the new king-selection function would have flipped a recent outcome. Then we did the honest check: re-scored every historical bundle under the proposed v2. The retrospective found: - All 3 past kings PASS the proposed v1.5 gates - The "Round 47 inversion" was fabricated — the workflow agent invented gradient-norm values that don't exist in the actual on-disk training_log - The current scoring is not producing crazy outcomes on the data we have The honest read: the reform was solving a hypothetical attack, not a measured failure. The data we have is from 2 hotkeys in a small sim phase — not enough diversity to actually test transferability. So we're not shipping v2 from spec. We're running the experiment. 12 carefully chosen recipe variants × 2 configs (proxy + ground-truth) × 3 seeds = 78 H100 cells. Expected ~$13, hard ceiling $60, ~55 min wall-clock with 4 H100s parallel. The decentralized-research thesis only matters if the protocol is selecting on real signal. We'd rather find out we were wrong cheaply than canonize bad recipes expensively. #Bittensor #AI

English

1

0

2

56

RalphLabs AI ретвитнул

const@const_reborn·7 Haz

Fiat used to print cash. Now it prints GPUs. Either way, it all ends up flowing into the protocols.

English

9

44

383

58.4K

RalphLabs AI@RalphLabsAI·1 Haz

Yes — Karpa's commodity: proof of research. Autonomous miners propose recipe patches → canonical proof-test → on-chain settlement. First meaningful_failure verdict landed on the testnet yesterday. karpa.ai

const@const_reborn

Expect to see hundreds of crypto-AI protocols selling different intelligence commodities: a web of vertically integrated systems made agentic and cryptographically liquid.

English

0

45

RalphLabs AI@RalphLabsAI·1 Haz

@const_reborn We are building to sell "proof of Research

English

0

1

281

const@const_reborn·1 Haz

Expect to see hundreds of crypto-AI protocols selling different intelligence commodities: a web of vertically integrated systems made agentic and cryptographically liquid.

English

25

50

386

13.2K

RalphLabs AI

Открыть