David Smith

2K posts

David Smith

David Smith

@dasmiq

Associate professor of Computer Science at Northeastern University, researching NLP, ML, IR, and digital humanities. https://t.co/rSvC7ikLhe

Katılım Ekim 2011
181 Takip Edilen1.1K Takipçiler
David Smith
David Smith@dasmiq·
Now that the semester’s over I wanted to share the readings @giulia_taurino and I and the students in our seminar on Artificial Intelligence as an Archival Science put together. We had a great time; they wrote good papers; hopefully this will be useful. github.com/dasmiq/cs7180-…
English
3
12
30
3.6K
Alexander Doria
Alexander Doria@Dorialexander·
@dasmiq Well it would be mostly fine tuning for now: I have a gpt-4 generated dataset for quoting texts with sources which is far from perfect. I’m not that convinced by controlled generation vs. simply rejecting generations since default vllm is already so well optimized
English
1
0
1
47
Alexander Doria
Alexander Doria@Dorialexander·
Small question: for text alignment/reuse detection with Python is there any simple alternative to passim? basically only need to extract one short excerpt in a longer text, not billion of text reuses.
English
3
1
4
1.4K
David Smith
David Smith@dasmiq·
@Dorialexander Nice! I'd be interested to see what kind of classifier would work to guide inference with a large enough source corpus. Are you just expecting this to work contrastively in the beam? Anyway, I'll have some free time on Friday to sketch the passim solution.
English
1
0
1
38
Alexander Doria
Alexander Doria@Dorialexander·
@dasmiq Actually if you have some pointers on how to proceed with Passim, would be really interested. The idea is to get LLM better at quoting texts (both through better fine tuning data prep and through evaluation at inference) and would really like to put this in an integrated pipeline
English
1
0
1
74
David Smith retweetledi
Rahul B (@rahulbot@vis.social)
Work with data in newsrooms, libraries, CSOs, museums, govt, or community? Excited to share I'm working on a book for *you* about creative data literacy and storytelling in pro-social settings. Tentatively titled "Community Data". Coming fall '24 from @OxUniPress 💡+🧑🏾‍💻=📗
Rahul B (@rahulbot@vis.social) tweet media
English
1
4
16
1.9K
Giovanni Colavizza
Giovanni Colavizza@giovanni1085·
Personal update: I have started a new position as an Associate Professor of Computer Science at the University of Bologna, Department of Classical and Italian Philology. @Unibo @BoldhUnibo @UniboDHARC
Giovanni Colavizza tweet media
English
10
4
104
6K
David Smith
David Smith@dasmiq·
@nyhabash I am so sorry, Nizar. Peace be with you and your family.
English
0
0
0
466
David Smith retweetledi
David Smith
David Smith@dasmiq·
Last but not least, in EMNLP Findings, Liwen Hou continues her brilliant line of work on diachronic syntax by investigating how we can probe language models trained on different time periods. khoury.northeastern.edu/home/dasmith/h…
English
1
1
2
291
David Smith retweetledi
David Smith
David Smith@dasmiq·
Next in CHR, @muther22 and Mathew Barber use language models to probe modern and mediaeval citation practices. A citation is a query in a noisy channel model that the author of a target text thinks might help you find the source. khoury.northeastern.edu/home/dasmith/m…
English
1
1
1
262
David Smith retweetledi
kartik goyal
kartik goyal@kartik_goyal_·
New CHR paper with an amazing set of collaborators: we find that high-recall bitext mining and sentence alignment is actually kinda tricky for messy historical literary text. Multilingual embeddings like LaBSE and friends work surprisingly well for literary ancient Greek though!
David Smith@dasmiq

Caroline Craig, @kartik_goyal_ , @farnooshamsian , and @PhilologistGRC have a CHR paper on getting document-level sentence alignment to work for the ancient Greek and Latin corpus to track multiple translations into English, French, German, Persian, etc. khoury.northeastern.edu/home/dasmith/c…

English
0
1
6
1.3K
David Smith
David Smith@dasmiq·
A lot of past work on historical syntax involved treebanking text from different time periods. Instead, Liwen compares language models trained on different time periods on modern tagging and parsing tasks to detect language change.
David Smith tweet media
English
1
0
0
143
David Smith
David Smith@dasmiq·
I don't post over here much anymore, but I want to point out the great work of my talented coauthors in some recent papers on #NLProc, #DH, #HTR , and historical linguistics.
English
1
0
4
419