Scott Enderle

5.7K posts

Scott Enderle banner
Scott Enderle

Scott Enderle

@scottenderle

DH at Penn Libraries. Increasingly stealthy. He/him, opinions mine, everything's a bookmark.

Philadelphia, PA Katılım Temmuz 2009
499 Takip Edilen614 Takipçiler
Scott Enderle
Scott Enderle@scottenderle·
You have a dimension reduction problem and two solutions. One is simpler mathematically, but harder to explain. The other is more complex mathematically, but easier to explain. They work equally well. Which do you go with?
English
1
0
0
0
Scott Enderle retweetledi
Maria Antoniak
Maria Antoniak@maria_antoniak·
I've updated little-mallet-wrapper to output the MALLET diagnostics file (includes coherence) and the full word weight distributions for each topic. You can load the word weights and also compare pairs of topics using Jensen-Shannon divergence. github.com/maria-antoniak…
English
0
12
58
0
Scott Enderle
Scott Enderle@scottenderle·
@srchvrs @IgorBrigadir @Nils_Reimers @clured @AlexReibman You can see this even in the way UMAP preserves geometric artifacts of the underlying model. I'm intrigued by the comet-like forms you see on the edges of the big blob. See also: twitter.com/scottenderle/s…
Scott Enderle@scottenderle

When you throw vectors of LDA topics haphazardly at UMAP and get these triangle looking things — is it somehow recovering the shape of the Dirichlet prior?

English
0
0
1
0
Leo Boytsov
Leo Boytsov@srchvrs·
@IgorBrigadir @Nils_Reimers @clured @AlexReibman I would think a lot. UMAP reduces dimensionality, but if you don't get close things to be close in the original space routinely, you won't see them close in the projection (not as often as you would expect).
English
2
0
4
0
David McClure
David McClure@clured·
Playing with more temporally-keyed BERT embeddings of big corpora. Here are all 3.5M unique "story" titles posted to Hacker News from 2006-present, colored by year. Embedded with the `stsb-distilbert-base` model from SBERT (@Nils_Reimers), then UMAP to 2d.
David McClure tweet media
English
6
23
94
0
Scott Enderle
Scott Enderle@scottenderle·
@elotroalex @Ted_Underwood @matthewdlincoln @hathitrust Via HTRC data capsules, they can also immediately start offering the same features that PQ is offering for those physical materials. Only advantage PQ has is time to access, e.g. they can put yesterday's Time in a Jupyter notebook; libraries probably can't.
English
1
0
2
0
Scott Enderle
Scott Enderle@scottenderle·
"As the black hole expanded along Spruce street, swallowing streetcars and Amazon delivery trucks whole, the Administrators realized the depth of their folly."
English
0
0
6
0
Scott Enderle
Scott Enderle@scottenderle·
@murchgator @Ted_Underwood For some reason, the main detail that I remember is that at some point, the Stainless Steel Rat evades capture by spending a year hiding out in an automated fast food franchise
English
1
0
0
0
Scott Enderle
Scott Enderle@scottenderle·
@quadrismegistus Ugh you know I think they might punt and rely on author bio data as a good enough proxy... Which, sigh.
English
1
0
1
0
Ryan Heuser / @heuser.bsky
Ryan Heuser / @heuser.bsky@quadrismegistus·
@scottenderle Looks great! Does the metadata include original publication date? Drives me crazy that normal Gutenberg doesn't seem to store that crucial piece of info!
English
1
0
3
0
Scott Enderle
Scott Enderle@scottenderle·
If you have not already discovered Gutenberg, dammit, have a look, it's great! Really excellent for students and anybody who wants to play around with Gutenberg texts in a low-bar-to-entry way. github.com/aparrish/guten…
English
1
6
37
0
Scott Enderle retweetledi
sarah jeong
sarah jeong@sarahjeong·
something I didn't know until I went to law school(!!!!!!!) was that universal daycare was a popular — sometimes mainstream — feminist demand in the 1960s and 1970s. for all we talk about women empowerment, the arc of history, and so on, there was a giant leap back in the culture
English
69
1K
8.2K
0
Scott Enderle
Scott Enderle@scottenderle·
This thread is a good reminder that stopword lists are a form of feature selection. But "stopword list creation" sounds way less important and serious and frowny than "feature selection," doesn't it?
Melanie Walsh@mellymeldubs

Here are some words that scikit-learn, the popular Python machine learning library, gets rid of by default (stopwords): - fire - cry - system - serious - empty - thick - thin - whole - describe - detail What the heck these are good words!

English
0
0
5
0
Scott Enderle retweetledi
David McClure
David McClure@clured·
Playing with the C4 corpus from @ai2_allennlp. Here are 1M occurrences of the words "red" and "blue" (500k each), embedded via DistilBERT, where the words are [MASK]'ed in the input sequences, and then the mask embedding is sliced out of the top layer. Then UMAP to 2d.
David McClure tweet media
English
6
14
74
0