Joseph Pollack #Ï 🎗️
3.8K posts

Joseph Pollack #Ï 🎗️
@josephpollack
🤖AI❤️Data enjoyer , building robots to helps folks learn things quicker.



What if you could take three completely different model families… and distill them into one tiny model? 🤯 📜 Paper: arxiv.org/pdf/2605.21699 MOPD (Multi-Teacher On-Policy Distillation) has become a standard procedure in post-training. We already distill multiple specialized variants of the same model into a single set of weights. But what if we could go further - and distill models from entirely different families? Turns out, it is possible. Today we’re releasing a paper on cross-tokenizer distillation - our first steps in this exciting direction. 📄 We distilled Qwen3-4B, Phi-4-Mini, and Llama-3B into Llama-3.2-1B. MMLU jumped from 32.05 → 46.32 when using multiple teachers. 📈 The team is now working on Nemo-RL integration so the community can try this method in their own settings. Plus, we are scaling experiments up. 🚀

Today, among the goods that are universally intended for everyone, we must also include new forms of property, such as patents, algorithms, digital platforms, technological infrastructure and data. In a context where the wealth of nations depends increasingly on knowledge and technology, when these goods remain concentrated in the hands of a few, without adequate forms of sharing and access, a new imbalance is created that contradicts the universal destination of goods. In turn, it widens the gap between the included and the excluded, between those who can participate in the digital revolution and those who remain on the margins. #MagnificaHumanitas


WARNING: GRAPHIC CONTENT Russia pounded Kyiv and surrounding areas with hundreds of drones and missiles in one of the heaviest bombardments of the city since the start of the four-year war reut.rs/4v1YG1u

omg it's totally going to work, folks! this will be amazing!




Releasing my first kernel on @huggingface: MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA. Result is 3–5× speedup compared to naive PyTorch. Try it out 👇

Limited time offer: 90% off Ring-2.6-1T and Ling-flash-2.6 on @OpenRouter with @novita_labs ! Ring-2.6-1T: Extreme thinking model is here to help you with complex planning. Ling-flash-2.6: Help you save $$$ by offering extreme token efficiency. Dive into the details below 👇



A satellite image tells you what the Earth looked like at one moment. COP-GEN tells you what it could look like, and why that distinction matters more than it sounds. Most Earth observation models are deterministic. If you feed in a DEM and a land-cover map, and they produce one output: the most likely optical image. It's basically one question, one answer. The problem is that the real world doesn't work that way. The same terrain on the same coordinates can look completely different depending on cloud cover, season, soil moisture, atmospheric scattering, and a dozen other variables that aren't in your input. There's no single "correct" image. There's a distribution of plausible images. Deterministic models collapse that distribution to its mean, and they call it a prediction. COP-GEN, from researchers at Edinburgh and ESA, is built around this problem. It's a multimodal latent diffusion transformer trained on Copernicus data: Sentinel-2 optical, Sentinel-1 SAR, elevation, land cover, timestamps, and geolocation. Rather than predicting the most likely output, it samples from a learned distribution of physically plausible outputs. Ask it the same question sixteen times, and you get sixteen different but coherent answers. The benchmark numbers make this quite concrete. Against TerraMind, the existing benchmark model, COP-GEN achieves a spectral recall of 0.900. TerraMind achieves 0.028. That means COP-GEN's generated samples cover 90% of the real observation manifold. TerraMind's cover just 2.8%. Its sixteen outputs are nearly identical to each other, clustered near the conditional mean, and effectively invisible to the real data distribution. It wins on precision (each individual sample is close to a plausible real image) but fails entirely on recall (it can't reproduce the range of valid observations). The authors call this diversity collapse, and it's not a minor flaw. It's a structural consequence of deterministic training objectives. When you optimise for "produce the most accurate single output", you end up with a model that produces almost the same output every time. That's fine if you want a point estimate. It's a problem if you're trying to model uncertainty, simulate counterfactuals, or generate training data for downstream tasks. COP-GEN trades some of that per-sample precision for real coverage. Its intra-set diversity is 9.1 times higher than TerraMind's in spectral space. Its MMD (maximum mean discrepancy from the real distribution) is roughly half. It covers 63% of the real per-band reflectance range; TerraMind covers 18%. The practical implications aren't subtle though. Cloud gap-filling is the obvious one: when optical imagery is missing, you can't just impute a mean. You want a sample from the distribution of what the surface probably looked like, not a blurred average. Change detection across seasons has the same problem. Uncertainty quantification for downstream land-use models, water stress mapping, disaster monitoring. These tasks all require knowing not just what's most likely, but what range of outcomes is physically plausible. Band infilling is another demonstration of what the architecture can do. Feed COP-GEN only the four high-resolution visible bands (B2, B3, B4, B8) and it reconstructs the remaining Sentinel-2 spectral bands, the Sentinel-1 SAR, elevation, land cover, timestamp, and geolocation. It's inferring the full observational signature of a location from a narrow slice of it. The architecture treats each sensor and each spectral group as an independent modality with its own latent encoder. Resolution-aware tokenisation means Sentinel-2's 10m, 20m, and 60m bands are handled separately, preserving native sensor characteristics instead of resampling everything to a common grid. The diffusion process runs independent timesteps across modalities, which is what enables zero-shot any-to-any conditional generation without task-specific retraining. The paper is honest about where it falls short. Geolocation and timestamp conditioning have limited influence on outputs. Snow appears near the equator. The spatial modalities dominate the diffusion loss because they're represented by far more tokens than a latitude-longitude pair or a date. That's a training imbalance problem, and the authors flag it as a clear direction for future work. What COP-GEN establishes, beyond the model itself, is an argument about evaluation. Standard pointwise metrics like MAE and PSNR reward deterministic solutions. A model that always produces the conditional mean will score well on those metrics and will have near-zero recall. The stochastic benchmark in this paper, comparing the full distribution of outputs rather than the best single sample, is closer to the right question. The EO community will need to adopt that framing if it wants to properly evaluate generative models. The architecture is available. The Major Tom dataset it trained on is public. The gap between "what the Earth looks like" and "what the Earth could look like" has a model now. Link to the full paper: arxiv.org/pdf/2603.03239















