Guide Labs

40 posts

Guide Labs

@guidelabsai

Engineering interpretable AI systems that are easy to understand, trust, and debug.

San Francisco, CA Katılım Ocak 2023

3 Takip Edilen853 Takipçiler

Sabitlenmiş Tweet

Guide Labs@guidelabsai·24 Şub

It’s official: the first large-scale inherently interpretable language model is here. Steerling-8B from @guidelabsai is the first and largest model that can trace every token it generates back to: → Input Context → Training data → Human-understandable concepts In other words, we've successfully trained Steerling-8B to trace its outputs and explain what has impacted that decision for more reliable manipulation. This isn’t post-hoc explainability. Interpretability is built directly into the model. 🔓Steerling-8B can self-monitor for memorized content and suppress it at inference time without retraining. That makes interpretability a first-class design principle, not an afterthought. This is a major step toward models we can actually understand, debug, and trust. Over the coming days, we’ll be sharing investigations into what Steerling-8B’s interpretability enables in practice. Stay tuned as we dive deeper into our research & how we are building LLMs we can trust. 🚨 Try it LIVE and help improve it: Guide Labs: guidelabs.ai/post/steerling… GitHub: github.com/guidelabs/stee… Hugging Face: huggingface.co/guidelabs/stee… Huge thank you to @TimFernholz and @TechCrunch for featuring this breakthrough. techcrunch.com/2026/02/23/gui… #Steerling8B #GuideLabs #AI #MachineLearning

English

430

52.7K

Guide Labs@guidelabsai·17h

Why this matters: Models that expose their internal reasoning are models we can actually debug. → Alignment becomes less like reshaping a black box and more like debugging software. 💡 With Steerling-8b, Guide Labs is bringing a fundamental shift in how we build safe AI. Full post: guidelabs.ai/post/steerling… #Steerling8b #GuideLabs #AI

English

Guide Labs@guidelabsai·17h

Together these capabilities point toward a different paradigm for AI alignment. Instead of repeatedly retraining and fine-tuning opaque models, we now have the ability to: → Trace behaviors to their source → Inspect the concepts involved → Intervene directly at inference time

English

Guide Labs@guidelabsai·17h

Aligning AI today functions like this: Observe an undesirable output → Collect examples → Retrain/Fine-tune → Evaluate → Repeat It’s slow, often expensive, and gives very little insight into why the model behaved that way. Today, with Steerling-8B, we’re unveiling a new two-stage approach: alignment without retraining.

English

480

Guide Labs@guidelabsai·6 Mar

Blog + dataset now live. Guide Labs: guidelabs.ai/post/the-finew… Huggingface: huggingface.co/datasets/guide… Interpretability starts with knowing what’s in your data. #AI #LLM #Interpretability #GuideLabs

English

137

Guide Labs@guidelabsai·6 Mar

Together, these artifacts let anyone explore pretraining data at the concept level, querying, filtering, and analyzing a 10B token corpus with interesting human understandable concepts. For the first time, you'll be able to explore pretraining data by meaning, not just by domain: ☑️ Query and filter data by semantic concepts ☑️ Analyze co-occurring topics and tones ☑️ Train and steer models using concept mixtures ☑️ Audit what models are actually exposed to

English

151

Guide Labs@guidelabsai·6 Mar

We’re releasing FineWeb Concept Atlas: a concept-annotated version of FineWeb-Edu built by Guide Labs. 10B tokens 95M chunks 16,790 human-readable concepts

English

2.6K

Guide Labs retweetledi

Initialized Capital@Initialized·27 Şub

If you haven’t tried Steerling-8B yet, it’s worth a look. @guidelabsai built the first large-scale inherently interpretable language model, where you can see why a model produced an answer, not just the answer itself. Give a try: Guide Labs: lnkd.in/gbUBq5qA GitHub: lnkd.in/gCq8X6BB Hugging Face: lnkd.in/ggU_DdB4

English

884

Guide Labs@guidelabsai·28 Şub

At @guidelabsai we’ve proven we don’t need to sacrifice intelligence to gain transparency. Standard AI systems have representations that are, by default, entangled but Steerling-8B takes a different path. It learns disentangled representations by construction through architectural and training-time constraints. Steerling-8B has already learned thousands of novel concepts that it was never explicitly constrained to represent, spanning domains from linguistics and programming to geography, culture, and abstract reasoning. Consequently we shift the question from: “Can we reverse-engineer what this model knows?” to: “What did this model learn?” Here we show that we can easily discover thousands of novel concepts from the model; concepts it was never explicitly trained to learn. The model discovered: - British English - spelling as a coherent system - unified “you” across six languages - separated spelled-out numbers from digit - typographic errors - a dedicated concept for broken Unicode These first ~100K concepts Steerling-8b discovered are evidence of what becomes possible when interpretability is a design choice rather than a retrofit. We’re unlocking an entirely new category of AI. Test the model LIVE: Guide Labs: guidelabs.ai/post/concept-d… Gitbub: github.com/guidelabs/stee… Hugging Face: huggingface.co/guidelabs/stee… #Steerling8b #GuideLabs #AI #MachineLearning

English

1.2K

Guide Labs@guidelabsai·26 Şub

Current methods for controlling model behavior modify behavior globally and often unpredictably. What if you could directly amplify or suppress specific concepts in the model's reasoning, with predictable effects on the output? 📣 Enter Steerling-8B, the world’s first large-scale inherently interpretable LLM Its concept bottleneck architecture enables users to perform ‘algebra’ in the concept space to control the model’s output. For the first time, you can add, remove, and compose human-understandable concepts at inference time to directly control what the model generates, without retraining or prompt engineering. This means we can not only explain what the model is doing, but we can actually control it natively, by directly modifying concept activations at inference time. Steerling-8B enables: ⚙️ Concept injection — steering a generic prompt toward any target domain ↩️ Concept suppression — unlearning a concept the model would otherwise express 🔗 Multi-concept composition — combining and opposing multiple concepts simultaneously Learn more: guidelabs.ai/post/steerling… Test and play with Steerling-8B: GitHub: github.com/guidelabs/stee… Hugging Face: huggingface.co/guidelabs/stee… #Steerling8B #GuideLabs #AI #MachineLearning

English

726

Guide Labs retweetledi

Julius Adebayo@juliusadml·21 Şub

We’ve been focused on one problem at @guidelabsai: making AI systems humans can truly understand and debug. We’re finally ready to share what we’ve built. Big things coming soon. Stay tuned... Join the waitlist: guidelabs.ai/waitlist/ #GuideLabs #MachineLearning #EthicalAI

English

1.3K

Guide Labs@guidelabsai·9 Ara

Read the detailed deep dive and play with the prototypes at: guidelabs.ai/post/prism/.

English

341

Guide Labs@guidelabsai·9 Ara

A PRISM-1.6B model performs within 3 percent of an equivalent, unconstrained, GPT model on downstream benchmarks.

English

417

Guide Labs@guidelabsai·9 Ara

We trained PRISM, a family of interpretable language models that trace their predictions to training data in a single forward pass. When a language model predicts the next token, which training samples is it relying on? PRISM answers this by design.

GIF

English

2.4K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry