

Secludy AI
12 posts

@Secludy
Privacy-Guaranteed Synthetic Data for Training AI Models



1/7 High-quality data is a major bottleneck to AI progress. But while recent LLMs were trained on ~hundreds of TBs of data, the world has digitized 180 zettabytes of it. A billion times more. The problem is access. In a new essay for The Launch Sequence, @iamtrask and @lace31692 lay out a possible solution…

LLMs leak up to 27.5% of sensitive training data PII (Personally Identifiable Information like emails, SSNs, VINs, Bitcoin wallets). @Secludy makes it easy to generate privacy-guaranteed synthetic data that is a near replica of the original unstructured dataset but better. How? They utilize privacy-protected LLMs by adding carefully controlled noise to the model weights before generating synthetic data for AI model fine-tuning/evaluations. They just released a technical report that demonstrates their approach. 🧵1/n LLMs are known to memorize and expose sensitive information, even when trained on masked unstructured datasets they can still retain and regurgitate personal data which is a major privacy risk.

