Kaleb Dubin
45 posts

Kaleb Dubin
@KalebDubin
Commercial Data Strategy @withprotegeai

HUGE police presence outside of Radio City Music Hall for the Knicks Game 4 Watch Party, the pep talk included: “If they win, it’s gonna get rowdy.” A cop told me that they are going be dispersed throughout the area. Tonight’s event is sold out / at capacity at 6K fans… #AlwaysKnicks


HUGE police presence outside of Radio City Music Hall for the Knicks Game 4 Watch Party, the pep talk included: “If they win, it’s gonna get rowdy.” A cop told me that they are going be dispersed throughout the area. Tonight’s event is sold out / at capacity at 6K fans… #AlwaysKnicks




NYPD is ready for game one.



Weekend project. NYC lines.





Excited to share that I’ve joined @withprotegeai as a Senior Machine Learning Researcher on the DataLab team. After 2 years at @CleanlabAI, working with the team was an incredibly formative experience. I’m deeply grateful for the chance I had to learn from them to work on data-centric AI with such thoughtful researchers and builders, and to contribute during a period that ultimately led to Cleanlab being acquired into @joinHandshake AI. I learned a tremendous amount about the importance of data quality, evaluation, and trustworthiness in modern AI systems to make them more accurate and reliable. Throughout my time there, my conviction only grew that the next major advances in AI will come not just from better models or more compute, but from better data. At DataLab, our goal is to treat the data layer of AI with the same scientific rigor that model labs apply to algorithms by building a dedicated research institution for AI data: designing high-fidelity datasets and multimodal benchmarks grounded in real-world scenarios, working closely with frontier labs on their hardest data challenges, and developing standardized ways, including “FICO scores for AI data”, to measure dataset quality, contamination, and benchmark reliability. Another important piece of this work is understanding how different kinds of data support different parts of the AI training stack. Reinforcement learning (RL) environments are a powerful form of training data that generate structured training tuples like (state, action, reward, next state) and are extremely useful for post-training optimization when the world can be simulated. But many of the highest-value domains for AI, including healthcare, enterprise workflows, and complex multimodal reasoning, cannot be faithfully simulated. Advancing models in these areas requires real-world datasets, carefully designed benchmarks, and domain-specific data for pre-training and mid-training adaptation. The idea behind DataLab is simple but important: every major leap in AI capability has historically followed a breakthrough in data (from ImageNet to large-scale web corpora). As models and compute continue to advance rapidly, closing the data gap, the gap between the data that AI systems need and the data that actually exists in usable form, may be one of the most important challenges for the field. Here is more info on some of the work the team has done so far: datalab.withprotege.ai





kind of a big deal but actual legend @ZitongYang0 has integrated @tinkerapi with @harborframework, so you can use Harbor on Tinker w ~no code change now 🤠🧡






