Scott Enderle
5.7K posts

Scott Enderle
@scottenderle
DH at Penn Libraries. Increasingly stealthy. He/him, opinions mine, everything's a bookmark.
Philadelphia, PA เข้าร่วม Temmuz 2009
499 กำลังติดตาม614 ผู้ติดตาม
Scott Enderle รีทวีตแล้ว

I've updated little-mallet-wrapper to output the MALLET diagnostics file (includes coherence) and the full word weight distributions for each topic. You can load the word weights and also compare pairs of topics using Jensen-Shannon divergence.
github.com/maria-antoniak…
English

@srchvrs @IgorBrigadir @Nils_Reimers @clured @AlexReibman You can see this even in the way UMAP preserves geometric artifacts of the underlying model. I'm intrigued by the comet-like forms you see on the edges of the big blob. See also: twitter.com/scottenderle/s…
Scott Enderle@scottenderle
When you throw vectors of LDA topics haphazardly at UMAP and get these triangle looking things — is it somehow recovering the shape of the Dirichlet prior?
English

@IgorBrigadir @Nils_Reimers @clured @AlexReibman I would think a lot. UMAP reduces dimensionality, but if you don't get close things to be close in the original space routinely, you won't see them close in the projection (not as often as you would expect).
English

Playing with more temporally-keyed BERT embeddings of big corpora. Here are all 3.5M unique "story" titles posted to Hacker News from 2006-present, colored by year. Embedded with the `stsb-distilbert-base` model from SBERT (@Nils_Reimers), then UMAP to 2d.

English

@elotroalex @Ted_Underwood @matthewdlincoln @hathitrust And libraries may not ever be able to put a digital-only newspaper in a data capsule! That's where this line of thinking gets more worrying.
English

@elotroalex @Ted_Underwood @matthewdlincoln @hathitrust Via HTRC data capsules, they can also immediately start offering the same features that PQ is offering for those physical materials. Only advantage PQ has is time to access, e.g. they can put yesterday's Time in a Jupyter notebook; libraries probably can't.
English

Huh, more Fourier transforms. Overlaps in interesting ways with our HathiTrust ACS project. syncedreview.com/2021/05/14/dee…
wiki.htrc.illinois.edu/display/COM/Se…
English

@murchgator @Ted_Underwood For some reason, the main detail that I remember is that at some point, the Stainless Steel Rat evades capture by spending a year hiding out in an automated fast food franchise
English

@murchgator @Ted_Underwood Yeah, the Stainless Steel Rat regularly skittered through my larval-stage sci fi reading material
English

@apjanco @Ted_Underwood Oh, this one is still too advanced for me!
English

@quadrismegistus Ugh you know I think they might punt and rely on author bio data as a good enough proxy... Which, sigh.
English

@scottenderle Looks great! Does the metadata include original publication date? Drives me crazy that normal Gutenberg doesn't seem to store that crucial piece of info!
English

If you have not already discovered Gutenberg, dammit, have a look, it's great! Really excellent for students and anybody who wants to play around with Gutenberg texts in a low-bar-to-entry way. github.com/aparrish/guten…
English
Scott Enderle รีทวีตแล้ว

This thread is a good reminder that stopword lists are a form of feature selection. But "stopword list creation" sounds way less important and serious and frowny than "feature selection," doesn't it?
Melanie Walsh@mellymeldubs
Here are some words that scikit-learn, the popular Python machine learning library, gets rid of by default (stopwords): - fire - cry - system - serious - empty - thick - thin - whole - describe - detail What the heck these are good words!
English
Scott Enderle รีทวีตแล้ว

Playing with the C4 corpus from @ai2_allennlp. Here are 1M occurrences of the words "red" and "blue" (500k each), embedded via DistilBERT, where the words are [MASK]'ed in the input sequences, and then the mask embedding is sliced out of the top layer. Then UMAP to 2d.

English
