Kaleb Dubin

45 posts

Kaleb Dubin banner
Kaleb Dubin

Kaleb Dubin

@KalebDubin

Commercial Data Strategy @withprotegeai

Katılım Mart 2023
385 Takip Edilen21 Takipçiler
Kaleb Dubin retweetledi
Angel Diaz
Angel Diaz@ADiaz456·
Lastly, to my friends Carmelo Anthony and Amare Stoudemire, rest now, brothers. We have the watch, and we’ll see you in Valhalla
English
21
1.7K
16.8K
340.2K
Kaleb Dubin retweetledi
finals bee
finals bee@bethaninic_·
there was a time in history when i had to put my faith in kevin knox
English
79
536
3.8K
84.2K
Kaleb Dubin retweetledi
Legion Hoops
Legion Hoops@LegionHoops·
There’s nothing like Madison Square Garden. Oh. My. Goodness. (via @jschwartz115)
English
64
644
7.6K
192.1K
Mark Gurman
Mark Gurman@markgurman·
Also in Power On: Apple is preparing an enhancement to Genmoji in iOS 27: Suggested Genmojis created from your photos and your commonly typed phrases. bloomberg.com/news/newslette…
English
16
22
406
55.6K
Kaleb Dubin retweetledi
Karan Singhal
Karan Singhal@thekaransinghal·
Today we’re introducing two big steps for health at OpenAI: - ChatGPT for Clinicians, a free version of ChatGPT designed for clinical work - HealthBench Professional, a new benchmark to evaluate real clinician chat tasks We’re excited about what this can unlock for care. ❤️
Karan Singhal tweet media
English
263
562
4.8K
1.6M
Kaleb Dubin retweetledi
Bobby Samuels
Bobby Samuels@BobbySamuels·
Today, I’m excited to announce our newest vertical: Spatial & Physical Intelligence. We have been investing in an entirely new category focused on supporting world models and robotics labs. From working closely with the labs in both domains we've noticed a consistent pattern: they end up needing the same underlying training data. Our thesis is anchored to four fundamental data types that are important for this development stage: 1) Ego- and Exo-centric Video: First and third-person footage of humans performing real-world tasks and vehicle-based captures of dynamic environments. Depth data, LiDAR, hand tracking, descriptive annotations, overlapping camera views, and time-synced data all increase spatial understanding. 2) Motion Capture: Mapping the "physics of the mundane," moving beyond entertainment and gaming to capture tactile object manipulation, locomotion, and human to human interactions. 3) Video Gameplay Data: Studio-grade simulated environments paired with precise player telemetry. 4) 3D Assets: 3D scans of objects & scenes including raw input files before construction. The Core Challenge: data is siloed. High quality training data is trapped in fragmented datasets instead of accessible in a unified data layer. Robotics and world models are developing more quickly than ever – and the data layer should move just as fast. We’re here to help build high-quality, content-rich datasets that create the data supply chain for Spatial & Physical Intelligence data for AI. This is a key area where we're scaling at Protege – we’re actively looking for builders and feedback. Come build with us and tell us what we’re missing! (see 🧵 below for more details)
Bobby Samuels tweet media
English
2
3
15
372
Kaleb Dubin retweetledi
Protege
Protege@withprotegeai·
Our CEO @BobbySamuels spoke with AI‐Tech Park about one of AI's biggest challenges: access to high-quality, real-world training data. Three key takeaways from the conversation: 🔍 The Data Bottleneck Is Real The internet still accounts a tiny fraction of data in the world. The most valuable information sources — ie. clinical records, film libraries, proprietary databases — have never been structured for model training. (which is why data isn't just "AI-ready" out of the box! ⚖️ Ethical Licensing Is the future The industry is shifting to collaboration and engagement. Our Chief Content Officer Dave Davis spoke about this in Brussels at the European Broadcasting Union (EBU)'s AI Forum last week: Transparent data licensing gives data providers revenue and control, while giving AI developers cleaner datasets without the legal risk. 🧪 Synthetic Data Isn't Enough! Models trained exclusively on Ai-created content (or even content created by people but in a vacuum) can perform well in a lab but fail in real world application. We need real-world data for grounded, unbiased training and evaluation. Full interview in thread!
Protege tweet media
English
2
2
2
158
Kaleb Dubin retweetledi
Protege
Protege@withprotegeai·
OpenAI cancelled Sora last week... and then their billion-dollar deal with Disney. 💡 Protege's Chief Content Officer Dave Davis explained in @verge coverage how Disney's response is indicative of the licensing-first world we've entered for any AI content: 𝘞𝘩𝘪𝘭𝘦 𝘴𝘰𝘮𝘦 𝘰𝘯𝘭𝘪𝘯𝘦 𝘤𝘩𝘢𝘵𝘵𝘦𝘳 𝘴𝘶𝘨𝘨𝘦𝘴𝘵𝘦𝘥 𝘪𝘵 𝘥𝘦𝘮𝘰𝘯𝘴𝘵𝘳𝘢𝘵𝘦𝘴 𝘢 𝘭𝘢𝘳𝘨𝘦𝘳 𝘧𝘢𝘪𝘭𝘶𝘳𝘦 𝘰𝘧 𝘈𝘐 𝘪𝘯 𝘦𝘯𝘵𝘦𝘳𝘵𝘢𝘪𝘯𝘮𝘦𝘯𝘵, 𝘋𝘢𝘷𝘦 𝘋𝘢𝘷𝘪𝘴, 𝘤𝘩𝘪𝘦𝘧 𝘤𝘰𝘯𝘵𝘦𝘯𝘵 𝘰𝘧𝘧𝘪𝘤𝘦𝘳 𝘢𝘵 𝘗𝘳𝘰𝘵𝘦𝘨𝘦, 𝘴𝘢𝘪𝘥 𝘵𝘩𝘢𝘵 𝘪𝘵 𝘸𝘢𝘴 𝘤𝘭𝘦𝘢𝘳 𝘵𝘩𝘢𝘵 𝘋𝘪𝘴𝘯𝘦𝘺 𝘪𝘴 𝘷𝘦𝘳𝘺 𝘮𝘶𝘤𝘩 𝘴𝘵𝘪𝘭𝘭 𝘰𝘱𝘦𝘯 𝘵𝘰 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨 𝘢𝘨𝘳𝘦𝘦𝘮𝘦𝘯𝘵𝘴 𝘸𝘪𝘵𝘩 𝘰𝘵𝘩𝘦𝘳 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘸𝘰𝘳𝘬𝘪𝘯𝘨 𝘰𝘯 𝘷𝘪𝘥𝘦𝘰-𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘪𝘰𝘯 𝘈𝘐. 𝘛𝘩𝘢𝘵 𝘤𝘰𝘶𝘭𝘥 𝘦𝘷𝘦𝘯𝘵𝘶𝘢𝘭𝘭𝘺 𝘮𝘦𝘢𝘯 𝘱𝘢𝘳𝘵𝘯𝘦𝘳𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘭𝘪𝘬𝘦 𝘎𝘰𝘰𝘨𝘭𝘦, 𝘙𝘶𝘯𝘸𝘢𝘺, 𝘓𝘶𝘮𝘢, 𝘔𝘰𝘰𝘯𝘷𝘢𝘭𝘭𝘦𝘺, 𝘒𝘭𝘪𝘯𝘨, 𝘰𝘳 𝘚𝘦𝘦𝘥𝘢𝘯𝘤𝘦. “𝘞𝘦 𝘴𝘦𝘦 𝘢 𝘭𝘰𝘵 𝘰𝘧 𝘮𝘰𝘮𝘦𝘯𝘵𝘶𝘮 𝘪𝘯 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨,” 𝘋𝘢𝘷𝘪𝘴 𝘴𝘢𝘪𝘥, 𝘢𝘥𝘥𝘪𝘯𝘨 𝘵𝘩𝘢𝘵 𝘧𝘰𝘳 𝘸𝘦𝘭𝘭-𝘬𝘯𝘰𝘸𝘯 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘭𝘪𝘬𝘦 𝘋𝘪𝘴𝘯𝘦𝘺, 𝘪𝘵’𝘴 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢 𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘪𝘤 𝘱𝘭𝘢𝘺 𝘵𝘰 𝘪𝘯𝘴𝘱𝘪𝘳𝘦 𝘧𝘢𝘯 𝘪𝘯𝘵𝘦𝘳𝘢𝘤𝘵𝘪𝘰𝘯 𝘪𝘯 𝘯𝘦𝘸 𝘸𝘢𝘺𝘴. “𝘛𝘩𝘦 𝘋𝘪𝘴𝘯𝘦𝘺-𝘖𝘱𝘦𝘯𝘈𝘐 𝘥𝘦𝘢𝘭 𝘸𝘢𝘴 𝘰𝘯𝘦 𝘴𝘪𝘨𝘯 𝘰𝘧 𝘵𝘩𝘢𝘵. 𝘐 𝘵𝘩𝘪𝘯𝘬 𝘪𝘵’𝘴 𝘨𝘳𝘦𝘢𝘵 𝘵𝘩𝘢𝘵 𝘪𝘯 𝘵𝘩𝘦 𝘦𝘹𝘪𝘵 𝘢𝘯𝘯𝘰𝘶𝘯𝘤𝘦𝘮𝘦𝘯𝘵, 𝘋𝘪𝘴𝘯𝘦𝘺 𝘮𝘢𝘥𝘦 𝘪𝘵 𝘤𝘭𝘦𝘢𝘳 𝘵𝘩𝘢𝘵 𝘵𝘩𝘦𝘺 𝘢𝘳𝘦 𝘸𝘪𝘥𝘦 𝘰𝘱𝘦𝘯 𝘧𝘰𝘳 𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴 𝘵𝘰 𝘤𝘰𝘯𝘵𝘪𝘯𝘶𝘦 𝘤𝘩𝘢𝘳𝘢𝘤𝘵𝘦𝘳 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘰𝘵𝘩𝘦𝘳 𝘱𝘢𝘳𝘵𝘪𝘦𝘴.” Not even a day after @openai announced the Sora news, Elon Musk's @xai stepped into the gap to announce that the company was doubling down on consumer AI videos - @bloomberg coverage in thread. Dave and the rest of the Protege media team continue to unlock new revenue streams for content providers - giving access to all video and multimodal model builders rather than just one.
Protege tweet media
English
2
2
5
196
Perplexity
Perplexity@perplexity_ai·
Perplexity Computer now connects to your health apps, wearable devices, lab results, and medical records. Build personalized tools and applications with your health data, or track everything in your health dashboard.
English
203
323
3.3K
1.6M
Engy Ziedan
Engy Ziedan@engyziedan·
We are excited for all the questions we will investigate and all the datasets we will unlock. Our team is small but has already contributed measurable value add to several foundation models.
matt turk@TurkMatthew

Excited to share that I’ve joined @withprotegeai as a Senior Machine Learning Researcher on the DataLab team. After 2 years at @CleanlabAI, working with the team was an incredibly formative experience. I’m deeply grateful for the chance I had to learn from them to work on data-centric AI with such thoughtful researchers and builders, and to contribute during a period that ultimately led to Cleanlab being acquired into @joinHandshake AI. I learned a tremendous amount about the importance of data quality, evaluation, and trustworthiness in modern AI systems to make them more accurate and reliable. Throughout my time there, my conviction only grew that the next major advances in AI will come not just from better models or more compute, but from better data. At DataLab, our goal is to treat the data layer of AI with the same scientific rigor that model labs apply to algorithms by building a dedicated research institution for AI data: designing high-fidelity datasets and multimodal benchmarks grounded in real-world scenarios, working closely with frontier labs on their hardest data challenges, and developing standardized ways, including “FICO scores for AI data”, to measure dataset quality, contamination, and benchmark reliability. Another important piece of this work is understanding how different kinds of data support different parts of the AI training stack. Reinforcement learning (RL) environments are a powerful form of training data that generate structured training tuples like (state, action, reward, next state) and are extremely useful for post-training optimization when the world can be simulated. But many of the highest-value domains for AI, including healthcare, enterprise workflows, and complex multimodal reasoning, cannot be faithfully simulated. Advancing models in these areas requires real-world datasets, carefully designed benchmarks, and domain-specific data for pre-training and mid-training adaptation. The idea behind DataLab is simple but important: every major leap in AI capability has historically followed a breakthrough in data (from ImageNet to large-scale web corpora). As models and compute continue to advance rapidly, closing the data gap, the gap between the data that AI systems need and the data that actually exists in usable form, may be one of the most important challenges for the field. Here is more info on some of the work the team has done so far: datalab.withprotege.ai

English
1
0
3
209
Mistral AI
Mistral AI@MistralAI·
🤝 This will be the first project we build together with Nvidia as we become a founding member of the Nemotron Coalition. Details: mistral.ai/news/mistral-a…
English
6
18
216
22.9K
Mistral AI
Mistral AI@MistralAI·
🚀Announcing a strategic partnership with NVIDIA to co-develop frontier open-source AI models, combining Mistral AI’s frontier model architecture and full-stack AI offering with NVIDIA’s leading compute infrastructure and development tools.
Mistral AI tweet media
English
107
398
4.1K
259.1K
Zitong Yang
Zitong Yang@ZitongYang0·
This is only possible with @tyler_griggs_'s tool use library github.com/thinking-machi… I am unfortunately late to the party, but I only recently realized how much of a paradigm shift multi-turn+tool-use is. I even wonder if it makes sense to rewrite the entire pretraining corpus into an agentic trajectory? This solves two problems: (1) removing the gap between pretraining and test distribution; (2) agentic turn change can function as a natural "glue" that puts related internet documents together in context -- agent browsing one document at turn 7 influences its action/generation at turn 107 -- encoding the internet in a natural long-context format. Also, a great time to share that I have joined @thinkymachines. Thanks @miramurati for teaching me the value of focus, @lilianweng for instilling in me the power of responsibility, and @johnschulman2 for showing me by example the free spirit of scientific exploration! We are hiring job-boards.greenhouse.io/thinkingmachin…
clare ❤️‍🔥@clarejtbirch

kind of a big deal but actual legend @ZitongYang0 has integrated @tinkerapi with @harborframework, so you can use Harbor on Tinker w ~no code change now 🤠🧡

English
5
7
117
31.6K