


Shayne Longpre
2.3K posts

@ShayneRedford
Lead the Data Provenance Initiative. PhD @MIT. 🇨🇦 Prev: @Google Brain, Apple, Stanford. AI/ML/NLP




Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule (@johngfriedman), @TeraflopAI, and @daftengine collaborated to release 43 Billion Tokens of SEC EDGAR data.






I'm not brave enough to watch myself on camera🫣, but @shannonzshen is a great interviewer and I remember us having really interesting discussions! Annnd we made sure to feature CMU’s Scotty in the scene so don’t miss it!...🐶



Beginnings are very special. Today is an important day for @adaptionlabs. Today a handful of one-size-fits-all-models are optimized for the average use case. Averages erase the exceptional. Everything intelligent adapts. So should AI.

📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality. The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer: 🌍Are scaling laws different by language? 🧙♂️Can we model the curse of multilinguality? ⚖️Pretrain from scratch or finetune from multilingual checkpoint? 🔀Cross-lingual transfer scores for 1444 lang pairs? 1/🧵





New research presents the most compelling evidence yet that generative AI directly stores and reproduces material used to train it—a finding that could have massive legal consequences for the tech industry, Alex Reisner reports. theatlantic.com/technology/202…




