8K posts

Ben Schmidt / @benmschmidt@sigmoid.social banner

Ben Schmidt / @[email protected]

@benmschmidt

VP of Information Design @nomic_ai, building new ways to interpret and shape embedding models. Onetime history/digital humanities prof. @bschmidt.bsky.social

Montclair/Manhattan Katılım Aralık 2010

1.1K Takip Edilen8.8K Takipçiler

Sabitlenmiş Tweet

Ben Schmidt / @[email protected]@benmschmidt·13 Nis

Read and explore this rich interactive of 20 *million* research articles from PubMed, a project we're releasing today with @ritagonmar and @hippopedoid. It's a *beautiful* embedding structure, a fascinating, complete corpus. Some highlights (thread) static.nomic.ai/pubmed.html

English

212

737

135.8K

Ben Schmidt / @[email protected] retweetledi

Nomic@nomic_ai·7 Kas

AI systems excel in domains that have abundant coverage in internet data. Large sectors of the economy are not digital-native. Their data, processes, and workflows are governed by signals that are out of distribution of foundation models. Introducing the new Nomic Platform

English

10.1K

Ben Schmidt / @[email protected] retweetledi

Andriy Mulyar@andriy_mulyar·13 Ağu

Nomic has a new X account. Stay tuned for some exciting updates over the next few months.

Nomic@nomic_ai

We're re-branding! This is now the new official Nomic X account! Follow us for updates on new open-source AI models and platform developments!

English

1.6K

Ben Schmidt / @[email protected] retweetledi

Nomic@nomic_ai·13 Ağu

We're re-branding! This is now the new official Nomic X account! Follow us for updates on new open-source AI models and platform developments!

English

7.7K

Ben Schmidt / @[email protected] retweetledi

Andriy Mulyar@andriy_mulyar·15 Haz

hiring an ml intern to work on vlm postraining for a special project, reports directly to me. must be exceptional. apply via dms.

English

219

39.9K

Ben Schmidt / @[email protected]@benmschmidt·28 Mar

In general I try not to post high-quality original content to this account anymore, and I feel pretty confident that the above post doesn't violate that practice.

English

406

Ben Schmidt / @[email protected]@benmschmidt·28 Mar

it works

Ben Schmidt / @benmschmidt@sigmoid.social tweet media

English

674

Ben Schmidt / @[email protected] retweetledi

CalCo@calco_io·5 Mar

Introducing Atlas Analyst: The Data Agent for Data Analytics Ask questions, get answers with references to your data, and immediately take action based on those insights.

English

5.1K

Ben Schmidt / @[email protected] retweetledi

Alexander Doria@Dorialexander·11 Şub

Announcing the release of Common Corpus 2. The largest fully open corpus for pretraining comes back better than ever: 2 trillion tokens with document-level licensing, provenance and language information. huggingface.co/datasets/PleIA…

English

388

41.5K

Ben Schmidt / @[email protected] retweetledi

Andriy Mulyar@andriy_mulyar·24 Oca

Hugging Face is the hub for AI datasets and today we bring every dataset to life with Nomic's first-class Hugging Face data connector. With a few clicks, you can now vector search, curate, and collaborate on any dataset in @huggingface huggingface.co/blog/MaxNomic/…

English

1.5K

Ben Schmidt / @[email protected] retweetledi

Daniel van Strien@vanstriendaniel·24 Oca

I created a map for Hub dataset cards using this new connector in less than 5 minutes.

CalCo@calco_io

Vector Search Any Hugging Face Dataset 🤗 Introducing the @huggingface Datasets Connector in Nomic Atlas huggingface.co/blog/MaxNomic/…

English

992

Ben Schmidt / @[email protected] retweetledi

CalCo@calco_io·24 Oca

Vector Search Any Hugging Face Dataset 🤗 Introducing the @huggingface Datasets Connector in Nomic Atlas huggingface.co/blog/MaxNomic/…

English

16.5K

Ben Schmidt / @[email protected] retweetledi

CalCo@calco_io·27 Ara

Introducing Open-Source, On-Device Inference-Time Compute in GPT4All - New : GPT4All Reasoner v1 - Support for Code Interpreter, Tool Calling and Code Sandboxing Inference-time compute is now available to every laptop in the world.

GIF

English

354

33.8K

Ben Schmidt / @[email protected] retweetledi

Wilson Marcílio Jr@EstecioJunior·23 Ara

Comparing ModernBERT and BERT embeddings reveals some nice properties. The embeddings from the two base architectures show different features for this dataset in terms of class cohesion. atlas.nomic.ai/data/wilson/mo…

English

1.7K

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander which is kinda weird actually given that they did the Bodleian but not BNF -- do you have any sense what library those scans would be from?

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander A bit more. Here's the French counts in the second half of the 17C (columns are `year, words, pages, books`) from storage.googleapis.com/books/ngrams/b…

English

117

Alexander Doria@Dorialexander·8 Ara

Google books studies like this still fail to address the significant corpus effects: *Language: Latin is discounted but was the international language of science. *Format: big turning point in the 18th century is the rise of newspapers and periodicals.

Whyvert@whyvert

Interesting new paper! 1520-1720 elite human capital became obsessed with religion (and likely high in religiosity) then 200 years later suddenly changed to be less religious. As shown by density of the words God, Jesus, and Christ (vernacular and Latin) in books.

English

1.7K

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander Also OTOH nobody in DH uses google books/ngrams, so they probably don't view it as a relevant field.

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander My guess would be that they occasionally chat with Bob Darnton or something, but they're not interested in the DH people because they figure they have all the computer expertise so they just need to check that against book expertise.

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander Still though after those changes what I'm seeing is that GB has high dozens to low hundreds of books annually in the english corpus for the 17C. EEBO is like 10x that, although maybe a lot of EEBO is Latin?

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander I don't think they really care? Not sure. I'm kind of amazed on reflection that in the last 15 years I don't think I've never met a single person actually working on Google Books, even though they funded my postdoc.

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander And hah they didn't even bother to add it to the download page! storage.googleapis.com/books/ngrams/b…

English

Ben Schmidt / @[email protected]@benmschmidt·8 Ara

@Dorialexander Oh shit and there was a 2024 update too. I haven't seen a word about that.

English

Keşfet

@huggingface @Dorialexander @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA