Thomas Sounack

37 posts

Thomas Sounack

Thomas Sounack

@tsounack

AI/ML Engineer @ Dana-Farber Cancer Institute | Stanford alum

Katılım Mayıs 2024
52 Takip Edilen87 Takipçiler
Sabitlenmiş Tweet
Thomas Sounack
Thomas Sounack@tsounack·
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
English
4
14
66
16.7K
Thomas Sounack
Thomas Sounack@tsounack·
Another interesting finding was that simple fine-tuning allowed these small models to consistently return parsable JSON outputs. This may be worth exploring if you plan to use small LLMs for a structured output generation task
English
0
0
0
20
Thomas Sounack
Thomas Sounack@tsounack·
These small open-sourced LLMs can run on laptops (and even good smartphones), meaning that any institution can run them securely behind their firewall. This is significant since HIPAA-compliant LLM access is still rare for medical institutions.
English
1
0
0
25
Thomas Sounack
Thomas Sounack@tsounack·
Our Medslice paper was just accepted at @JAMIAOpen! We provide a pipeline to extract clinically relevant sections of medical notes (HPI, Interval Hx, Assessment and Plan) using fine-tuned language models.
English
1
1
3
46
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
@tsounack thanks for sharing. have you done any evals on downstream tasks specially medical token for classification to see the gain over the original model?
English
1
0
0
78
Thomas Sounack
Thomas Sounack@tsounack·
Want to continue training an encoder on your own data, but not sure where to start? Our step-by-step guide for reproducing the BioClinical ModernBERT training was just released! 1/5
English
2
3
14
2.4K
Thomas Sounack
Thomas Sounack@tsounack·
If you are working with a lot of biomedical and/or clinical text, consider continuing MLM training of BioClinical ModernBERT on your own data! The resulting encoder will be much easier to fine-tune on your various downstream tasks (embedding model for RAG, classifier...) 4/5
English
1
0
1
152
Thomas Sounack
Thomas Sounack@tsounack·
Exciting to see BioClinical ModernBERT (base) ranked #2 among trending fill-mask models - right after BERT! The large version is currently at #4. Grateful for the interest, and can’t wait to see what projects people apply it to!
Thomas Sounack tweet media
English
0
7
12
942
Thomas Sounack
Thomas Sounack@tsounack·
BioClinical ModernBERT github repo is online! It contains: - Our continued pretraining config files - Performance eval code - Inference speed eval code Step-by-step guide on how to continue ModernBERT or BioClinical ModernBERT pretraining coming in the next few days!
English
1
3
17
804
Thomas Sounack retweetledi
Mike Dupont
Mike Dupont@introsp3ctor·
codepen.io/jmikedupont2/p… #scrollTo=R3DnkbFB-OB2" target="_blank" rel="nofollow noopener">colab.research.google.com/drive/1uSx8yYZ… next demo visualizing BioClinical-ModernBERT-base embeddings on a sphere
English
3
1
6
474
Antoine Chaffin
Antoine Chaffin@antoine_chaffin·
You can just continue pre-train things ✨ Happy to announce the release of BioClinical ModernBERT, a ModernBERT model whose pre-training has been continued on medical data The result: SOTA performance on various medical tasks with long context support and ModernBERT efficiency
Antoine Chaffin tweet media
Thomas Sounack@tsounack

Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)

English
4
33
212
69.5K
Jacques Sun
Jacques Sun@SunJacques_·
@tsounack Nice work, Thomas! 👏 FYI the GitHub link seems to be broken. Could you verify the URL? Would love to explore the implementation details.
English
1
0
1
70
Thomas Sounack
Thomas Sounack@tsounack·
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
English
4
14
66
16.7K