Jared Sulzdorf retweetledi
Jared Sulzdorf
1.8K posts

Jared Sulzdorf
@j_sulz
I like pretty things, functional things, funny things, food things, and computer things. Not necessarily in that order.
Seattle, WA Katılım Eylül 2008
287 Takip Edilen379 Takipçiler
Jared Sulzdorf retweetledi

The Hugging Face Hub team is on a tear recently:
> You can create custom apps with domains on spaces
> Edit GGUF metadata on the Fly
> 100% of the Hub is powered by Xet - faster, efficient
> Responses API support for ALL Inference Providers
> MCP-UI support for HF MCP Server
> Search papers based on the Org
> Showcase repository size on the UI
and a lot more - excited for the coming weeks/ months as we continue to improve the overall UX! 🤗

English
Jared Sulzdorf retweetledi

Today, we've finalized this first phase of migrating the Hub to a new, modern storage system. One that's built to scale with AI builders of today and tomorrow. huggingface.co/blog/from-file…
There's still a lot of work to do, but we're excited for what's next. 💪
English

The Hub is on 100% on Xet. 🚀
A little over a year ago, @huggingface acquired @xetdata to unlock the next phase of growth in models and datasets. huggingface.co/blog/xethub-jo…
In April, there were 1,000 Hugging Face repos on Xet. Now every repo (over 6M) on the Hub is on Xet.

English
Jared Sulzdorf retweetledi

@SIGKITTEN @pcuenq @julien_c Thanks @pcuenq appreciate you helping out here! @SIGKITTEN safe to say you're downloading on a Mac? We've run afoul of the small default file descriptor limit there 😅 Like Pedro notes, upping it with `ulimit -n [BIG NUMBER]` will do the trick.
English
Jared Sulzdorf retweetledi

New blog post 🚨 Every data engineer should read it
@kszucs_ (@ApacheArrow PMC) announces how to drastically speed up Parquet files uploads and downloads.
Yes, it can easily outspeed S3.
Best part: the feature enabling this is open source
Link in 🧵

English
Jared Sulzdorf retweetledi

A new Pandas feature landed 3 days ago and no one noticed.
Upload ONLY THE NEW DATA to dedupe-based storage like @huggingface (Xet). Data that already exist in other files don't need to be uploaded.
Possible thanks to the recent addition of Content Defined Chunking for Parquet.

English

We've moved the first 20PB from Git LFS to Xet on @huggingface without any interruptions, now we're migrating the rest of the Hub. We got this far by focusing on the community first.
Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrating…
English

These are hard numbers to put into context, but let's try.
The latest run of Common Crawl from @CommonCrawl was 471 TB.
We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
🤯🤯🤯
English

It's been a bit since I took a step back and looked at our progress to migrate @huggingface from Git LFS to Xet, but every time I do it's mind boggling.
A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB

English
Jared Sulzdorf retweetledi

Xet is now the default storage for new builders on @huggingface !
What it means for 🤗Datasets:
- Deduplicated downloads and uploads for speed⚡
- Works with the new Parquet CDC writer, robust to insert/delete/edits 💪
@ApacheParquet has a bright future on HF :)

English

@yukiarimo Faster uploads and downloads and storage that will let the Hub continue to scale!
Xet uses a chunk-based versioning approach instead of a file-based one. That, along with supporting infra and Rust client make for snappier transfers.
More details here: huggingface.co/blog/from-file…
English

New users and organizations can say goodbye to LFS on @huggingface; Xet is now the default storage for new builders on the Hub 🚀🚀🚀
Just sign up for an account, create a new repo, pip install huggingface_hub and you're off!
huggingface.co/changelog/xet-…
English

To migrate your existing repos to Xet, sign up here huggingface.co/join/xet
And we'll take care of the rest 🤗
English







