Kaleb Dubin

45 posts

Kaleb Dubin

@KalebDubin

Commercial Data Strategy @withprotegeai

Katılım Mart 2023

385 Takip Edilen21 Takipçiler

Kaleb Dubin retweetledi

Angel Diaz@ADiaz456·3d

Lastly, to my friends Carmelo Anthony and Amare Stoudemire, rest now, brothers. We have the watch, and we’ll see you in Valhalla

English

1.7K

16.8K

340.2K

Kaleb Dubin retweetledi

Josh Billinson@jbillinson·4d

Jalen Brunson is going to trap them in the tunnels underneath the city like Bane

Jennifer X. Williams@JenXperience

HUGE police presence outside of Radio City Music Hall for the Knicks Game 4 Watch Party, the pep talk included: “If they win, it’s gonna get rowdy.” A cop told me that they are going be dispersed throughout the area. Tonight’s event is sold out / at capacity at 6K fans… #AlwaysKnicks

English

471

43.4K

Kaleb Dubin@KalebDubin·4d

@KnicksMuse Lined up for bane.

English

677

KnicksMuse@KnicksMuse·4d

They’ve got ‘em lined up like an alien invasion is coming 😭😭😭

Jennifer X. Williams@JenXperience

English

296

5.1K

978.2K

Kaleb Dubin retweetledi

finals bee@bethaninic_·5d

there was a time in history when i had to put my faith in kevin knox

English

536

3.8K

84.2K

Kaleb Dubin@KalebDubin·20 May

Yup!! @nyknicks

Knicks Memes@KnicksMemes

"How many times are you going to watch the replay of that Landry Shamet 3 bouncing in to tie it for the Knicks?"

Kaleb Dubin retweetledi

Legion Hoops@LegionHoops·20 May

There’s nothing like Madison Square Garden. Oh. My. Goodness. (via @jschwartz115)

English

644

7.6K

192.1K

Kaleb Dubin@KalebDubin·17 May

@markgurman What’s a “genmoji” ???

English

117

Mark Gurman@markgurman·17 May

Also in Power On: Apple is preparing an enhancement to Genmoji in iOS 27: Suggested Genmojis created from your photos and your commonly typed phrases. bloomberg.com/news/newslette…

English

406

55.6K

Kaleb Dubin@KalebDubin·16 May

@markgurman You’re not just a random guy to me, you’re THE random guy @markgurman

English

1.4K

Mark Gurman@markgurman·16 May

“Random guy”

Proton Mail@ProtonMail

English

865

113.8K

Kaleb Dubin@KalebDubin·5 May

Looks like they’re prepared for Bane

Big Knick Energy@BigKnickEnergy_

NYPD is ready for game one.

English

Kaleb Dubin retweetledi

Karan Singhal@thekaransinghal·23 Nis

Today we’re introducing two big steps for health at OpenAI: - ChatGPT for Clinicians, a free version of ChatGPT designed for clinical work - HealthBench Professional, a new benchmark to evaluate real clinician chat tasks We’re excited about what this can unlock for care. ❤️

English

263

562

4.8K

1.6M

Kaleb Dubin retweetledi

Bobby Samuels@BobbySamuels·15 Nis

Today, I’m excited to announce our newest vertical: Spatial & Physical Intelligence. We have been investing in an entirely new category focused on supporting world models and robotics labs. From working closely with the labs in both domains we've noticed a consistent pattern: they end up needing the same underlying training data. Our thesis is anchored to four fundamental data types that are important for this development stage: 1) Ego- and Exo-centric Video: First and third-person footage of humans performing real-world tasks and vehicle-based captures of dynamic environments. Depth data, LiDAR, hand tracking, descriptive annotations, overlapping camera views, and time-synced data all increase spatial understanding. 2) Motion Capture: Mapping the "physics of the mundane," moving beyond entertainment and gaming to capture tactile object manipulation, locomotion, and human to human interactions. 3) Video Gameplay Data: Studio-grade simulated environments paired with precise player telemetry. 4) 3D Assets: 3D scans of objects & scenes including raw input files before construction. The Core Challenge: data is siloed. High quality training data is trapped in fragmented datasets instead of accessible in a unified data layer. Robotics and world models are developing more quickly than ever – and the data layer should move just as fast. We’re here to help build high-quality, content-rich datasets that create the data supply chain for Spatial & Physical Intelligence data for AI. This is a key area where we're scaling at Protege – we’re actively looking for builders and feedback. Come build with us and tell us what we’re missing! (see 🧵 below for more details)

English

372

Kaleb Dubin@KalebDubin·12 Nis

Very much a “why didn’t I think of this” project.

Lucas Gordon@lucasgordon

Weekend project. NYC lines.

English

Kaleb Dubin retweetledi

Protege@withprotegeai·31 Mar

Our CEO @BobbySamuels spoke with AI‐Tech Park about one of AI's biggest challenges: access to high-quality, real-world training data. Three key takeaways from the conversation: 🔍 The Data Bottleneck Is Real The internet still accounts a tiny fraction of data in the world. The most valuable information sources — ie. clinical records, film libraries, proprietary databases — have never been structured for model training. (which is why data isn't just "AI-ready" out of the box! ⚖️ Ethical Licensing Is the future The industry is shifting to collaboration and engagement. Our Chief Content Officer Dave Davis spoke about this in Brussels at the European Broadcasting Union (EBU)'s AI Forum last week: Transparent data licensing gives data providers revenue and control, while giving AI developers cleaner datasets without the legal risk. 🧪 Synthetic Data Isn't Enough! Models trained exclusively on Ai-created content (or even content created by people but in a vacuum) can perform well in a lab but fail in real world application. We need real-world data for grounded, unbiased training and evaluation. Full interview in thread!

English

158

Kaleb Dubin retweetledi

Protege@withprotegeai·30 Mar

OpenAI cancelled Sora last week... and then their billion-dollar deal with Disney. 💡 Protege's Chief Content Officer Dave Davis explained in @verge coverage how Disney's response is indicative of the licensing-first world we've entered for any AI content: 𝘞𝘩𝘪𝘭𝘦 𝘴𝘰𝘮𝘦 𝘰𝘯𝘭𝘪𝘯𝘦 𝘤𝘩𝘢𝘵𝘵𝘦𝘳 𝘴𝘶𝘨𝘨𝘦𝘴𝘵𝘦𝘥 𝘪𝘵 𝘥𝘦𝘮𝘰𝘯𝘴𝘵𝘳𝘢𝘵𝘦𝘴 𝘢 𝘭𝘢𝘳𝘨𝘦𝘳 𝘧𝘢𝘪𝘭𝘶𝘳𝘦 𝘰𝘧 𝘈𝘐 𝘪𝘯 𝘦𝘯𝘵𝘦𝘳𝘵𝘢𝘪𝘯𝘮𝘦𝘯𝘵, 𝘋𝘢𝘷𝘦 𝘋𝘢𝘷𝘪𝘴, 𝘤𝘩𝘪𝘦𝘧 𝘤𝘰𝘯𝘵𝘦𝘯𝘵 𝘰𝘧𝘧𝘪𝘤𝘦𝘳 𝘢𝘵 𝘗𝘳𝘰𝘵𝘦𝘨𝘦, 𝘴𝘢𝘪𝘥 𝘵𝘩𝘢𝘵 𝘪𝘵 𝘸𝘢𝘴 𝘤𝘭𝘦𝘢𝘳 𝘵𝘩𝘢𝘵 𝘋𝘪𝘴𝘯𝘦𝘺 𝘪𝘴 𝘷𝘦𝘳𝘺 𝘮𝘶𝘤𝘩 𝘴𝘵𝘪𝘭𝘭 𝘰𝘱𝘦𝘯 𝘵𝘰 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨 𝘢𝘨𝘳𝘦𝘦𝘮𝘦𝘯𝘵𝘴 𝘸𝘪𝘵𝘩 𝘰𝘵𝘩𝘦𝘳 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘸𝘰𝘳𝘬𝘪𝘯𝘨 𝘰𝘯 𝘷𝘪𝘥𝘦𝘰-𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘪𝘰𝘯 𝘈𝘐. 𝘛𝘩𝘢𝘵 𝘤𝘰𝘶𝘭𝘥 𝘦𝘷𝘦𝘯𝘵𝘶𝘢𝘭𝘭𝘺 𝘮𝘦𝘢𝘯 𝘱𝘢𝘳𝘵𝘯𝘦𝘳𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘭𝘪𝘬𝘦 𝘎𝘰𝘰𝘨𝘭𝘦, 𝘙𝘶𝘯𝘸𝘢𝘺, 𝘓𝘶𝘮𝘢, 𝘔𝘰𝘰𝘯𝘷𝘢𝘭𝘭𝘦𝘺, 𝘒𝘭𝘪𝘯𝘨, 𝘰𝘳 𝘚𝘦𝘦𝘥𝘢𝘯𝘤𝘦. “𝘞𝘦 𝘴𝘦𝘦 𝘢 𝘭𝘰𝘵 𝘰𝘧 𝘮𝘰𝘮𝘦𝘯𝘵𝘶𝘮 𝘪𝘯 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨,” 𝘋𝘢𝘷𝘪𝘴 𝘴𝘢𝘪𝘥, 𝘢𝘥𝘥𝘪𝘯𝘨 𝘵𝘩𝘢𝘵 𝘧𝘰𝘳 𝘸𝘦𝘭𝘭-𝘬𝘯𝘰𝘸𝘯 𝘤𝘰𝘮𝘱𝘢𝘯𝘪𝘦𝘴 𝘭𝘪𝘬𝘦 𝘋𝘪𝘴𝘯𝘦𝘺, 𝘪𝘵’𝘴 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢 𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘪𝘤 𝘱𝘭𝘢𝘺 𝘵𝘰 𝘪𝘯𝘴𝘱𝘪𝘳𝘦 𝘧𝘢𝘯 𝘪𝘯𝘵𝘦𝘳𝘢𝘤𝘵𝘪𝘰𝘯 𝘪𝘯 𝘯𝘦𝘸 𝘸𝘢𝘺𝘴. “𝘛𝘩𝘦 𝘋𝘪𝘴𝘯𝘦𝘺-𝘖𝘱𝘦𝘯𝘈𝘐 𝘥𝘦𝘢𝘭 𝘸𝘢𝘴 𝘰𝘯𝘦 𝘴𝘪𝘨𝘯 𝘰𝘧 𝘵𝘩𝘢𝘵. 𝘐 𝘵𝘩𝘪𝘯𝘬 𝘪𝘵’𝘴 𝘨𝘳𝘦𝘢𝘵 𝘵𝘩𝘢𝘵 𝘪𝘯 𝘵𝘩𝘦 𝘦𝘹𝘪𝘵 𝘢𝘯𝘯𝘰𝘶𝘯𝘤𝘦𝘮𝘦𝘯𝘵, 𝘋𝘪𝘴𝘯𝘦𝘺 𝘮𝘢𝘥𝘦 𝘪𝘵 𝘤𝘭𝘦𝘢𝘳 𝘵𝘩𝘢𝘵 𝘵𝘩𝘦𝘺 𝘢𝘳𝘦 𝘸𝘪𝘥𝘦 𝘰𝘱𝘦𝘯 𝘧𝘰𝘳 𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴 𝘵𝘰 𝘤𝘰𝘯𝘵𝘪𝘯𝘶𝘦 𝘤𝘩𝘢𝘳𝘢𝘤𝘵𝘦𝘳 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘰𝘵𝘩𝘦𝘳 𝘱𝘢𝘳𝘵𝘪𝘦𝘴.” Not even a day after @openai announced the Sora news, Elon Musk's @xai stepped into the gap to announce that the company was doubling down on consumer AI videos - @bloomberg coverage in thread. Dave and the rest of the Protege media team continue to unlock new revenue streams for content providers - giving access to all video and multimodal model builders rather than just one.

English

196

Kaleb Dubin@KalebDubin·20 Mar

@perplexity_ai 👀

QME

Perplexity@perplexity_ai·19 Mar

Perplexity Computer now connects to your health apps, wearable devices, lab results, and medical records. Build personalized tools and applications with your health data, or track everything in your health dashboard.

English

203

323

3.3K

1.6M

Kaleb Dubin@KalebDubin·18 Mar

@RealChalamet 🔥🔥🔥

QME

Timothée Chalamet@RealChalamet·17 Mar

DUNE PART THREE

English

2.6K

26.9K

174.9K

10.3M

Kaleb Dubin@KalebDubin·17 Mar

@engyziedan DataLab to the moon!!!

English

Engy Ziedan@engyziedan·17 Mar

We are excited for all the questions we will investigate and all the datasets we will unlock. Our team is small but has already contributed measurable value add to several foundation models.

matt turk@TurkMatthew

Excited to share that I’ve joined @withprotegeai as a Senior Machine Learning Researcher on the DataLab team. After 2 years at @CleanlabAI, working with the team was an incredibly formative experience. I’m deeply grateful for the chance I had to learn from them to work on data-centric AI with such thoughtful researchers and builders, and to contribute during a period that ultimately led to Cleanlab being acquired into @joinHandshake AI. I learned a tremendous amount about the importance of data quality, evaluation, and trustworthiness in modern AI systems to make them more accurate and reliable. Throughout my time there, my conviction only grew that the next major advances in AI will come not just from better models or more compute, but from better data. At DataLab, our goal is to treat the data layer of AI with the same scientific rigor that model labs apply to algorithms by building a dedicated research institution for AI data: designing high-fidelity datasets and multimodal benchmarks grounded in real-world scenarios, working closely with frontier labs on their hardest data challenges, and developing standardized ways, including “FICO scores for AI data”, to measure dataset quality, contamination, and benchmark reliability. Another important piece of this work is understanding how different kinds of data support different parts of the AI training stack. Reinforcement learning (RL) environments are a powerful form of training data that generate structured training tuples like (state, action, reward, next state) and are extremely useful for post-training optimization when the world can be simulated. But many of the highest-value domains for AI, including healthcare, enterprise workflows, and complex multimodal reasoning, cannot be faithfully simulated. Advancing models in these areas requires real-world datasets, carefully designed benchmarks, and domain-specific data for pre-training and mid-training adaptation. The idea behind DataLab is simple but important: every major leap in AI capability has historically followed a breakthrough in data (from ImageNet to large-scale web corpora). As models and compute continue to advance rapidly, closing the data gap, the gap between the data that AI systems need and the data that actually exists in usable form, may be one of the most important challenges for the field. Here is more info on some of the work the team has done so far: datalab.withprotege.ai

English

209

Kaleb Dubin@KalebDubin·17 Mar

@MistralAI Very exciting!! 🚀

English

296

Mistral AI@MistralAI·16 Mar

🤝 This will be the first project we build together with Nvidia as we become a founding member of the Nemotron Coalition. Details: mistral.ai/news/mistral-a…

English

216

22.9K

Mistral AI@MistralAI·16 Mar

🚀Announcing a strategic partnership with NVIDIA to co-develop frontier open-source AI models, combining Mistral AI’s frontier model architecture and full-stack AI offering with NVIDIA’s leading compute infrastructure and development tools.

English

107

398

4.1K

259.1K

Kaleb Dubin retweetledi

Engy Ziedan@engyziedan·15 Mar

It is great to see the persistent interest in evals from the researchers building these tools. I wrote about 5 RCTs in social science that examined health as an outcome and could have lessons for AI evaluations. Link below:engyziedan.substack.com/p/evaluating-m…

Karan Singhal@thekaransinghal

x.com/i/article/2032…

English

270

Kaleb Dubin@KalebDubin·14 Mar

@miramurati @ZitongYang0 @tyler_griggs_ Congrats! Excited to see how we can help on the training data front!

English

141

Mira Murati@miramurati·14 Mar

@ZitongYang0 @tyler_griggs_ Welcome to the team Zitong. It’s great working with you

English

7.8K

Zitong Yang@ZitongYang0·13 Mar

This is only possible with @tyler_griggs_'s tool use library github.com/thinking-machi… I am unfortunately late to the party, but I only recently realized how much of a paradigm shift multi-turn+tool-use is. I even wonder if it makes sense to rewrite the entire pretraining corpus into an agentic trajectory? This solves two problems: (1) removing the gap between pretraining and test distribution; (2) agentic turn change can function as a natural "glue" that puts related internet documents together in context -- agent browsing one document at turn 7 influences its action/generation at turn 107 -- encoding the internet in a natural long-context format. Also, a great time to share that I have joined @thinkymachines. Thanks @miramurati for teaching me the value of focus, @lilianweng for instilling in me the power of responsibility, and @johnschulman2 for showing me by example the free spirit of scientific exploration! We are hiring job-boards.greenhouse.io/thinkingmachin…

clare ❤️‍🔥@clarejtbirch

kind of a big deal but actual legend @ZitongYang0 has integrated @tinkerapi with @harborframework, so you can use Harbor on Tinker w ~no code change now 🤠🧡

English

117

31.6K

Keşfet

@KnicksMuse @nyknicks @jschwartz115 @markgurman @BobbySamuels @verge @openai @xai