Tom Jobbins

336 posts

Tom Jobbins

Tom Jobbins

@TheBlokeAI

My Hugging Face repos: https://t.co/yh7J4DFGTc Discord server: https://t.co/5h6rGsGfBx Patreon: https://t.co/yfQwFggGtx

UK Katılım Temmuz 2010
226 Takip Edilen15.4K Takipçiler
Nicole Zhu 👋
Nicole Zhu 👋@freelerobot·
Our holiday gift 🎁 to the open source AI community: 3 new models currently topping @huggingface LLM leaderboards! Run them on jan.ai (beta) 👀 We'll be launching a few more projects at @janhq_ this December, so stay tuned! 🙏 Credit: @TheBlokeAI @greennode23
👋 Jan@jandotai

👋 Meet Trinity, our experimental LLM that's #1 and #2 on the @huggingface OpenLLM Leaderboard. Trinity was created by merging LLMs with different strengths and weaknesses using SLERP. Here's how we did it: 🧵 Credit: @HaHoang411, @pokachi2023, @vuonghoainam

English
1
1
8
4K
Nicolas Patry
Nicolas Patry@narsilou·
This week was a good week: - Speculation runs on TGI (Medusa, ngram). Up to 3x speedup for all LLMs. (@TheBlokeAI we should get more medusa out). - Mixtral released on day1: fastest way to run it (quantized, speculation out of the box). - Metal flash integration in candle.
English
3
8
42
10.2K
Tom Jobbins retweetledi
emozilla
emozilla@theemozilla·
FYI to anyone using @MistralAI's Mixtral for long context tasks -- you can get even better performance by disabling sliding window attention (setting it to your max context length) config.sliding_window = 32768
emozilla tweet media
English
16
38
412
58.9K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
Transformers now supports Mixtral GPTQs and I've updated my READMEs accordingly. It was awesome working with @_marcsun and @younesbelkada of @huggingface on this! Credit to LaaZa for coding the AutoGPTQ quant and inference implementation which enabled me to get GPTQs out fast!
Marc Sun@_marcsun

Announcing 4-bit Mixtral 8x7B on 🤗Transformers! Run the new Mistal MoE with minimal performance degradation on your local computer (24Go) 🔥 Stay tuned as more quants are coming soon using AWQ. We are also looking into sparsification with @Tim_Dettmers huggingface.co/TheBloke/Mixtr…

English
13
20
130
41.5K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@MTrofficus You're much too kind - I've merely played a small part in pushing forward the wave. Remember that without the model creators, I'd have nothing to quantise! :) And without the model training code, they'd not be able to train. And so on We're all doing our bit in our own ways 🚀
English
1
2
25
811
Miguel Otero Pedrido
Miguel Otero Pedrido@moteropedrido·
Many say that the explosion of LLMs has been thanks to the Transformer architecture. ⚠️ You are terribly wrong ⚠️ It was thanks to @TheBlokeAI
English
3
2
11
959
Tom Jobbins retweetledi
younes
younes@yb2698·
Blazing fast text generation using AWQ and fused modules! 🚀 Up to 3x speedup compared to native fp16 that you can use right now on any models supported by @TheBlokeAI Simply pass an `AwqConfig` with `do_fuse=True` to `from_pretrained` method! #make-use-of-fused-modules" target="_blank" rel="nofollow noopener">huggingface.co/docs/transform…
GIF
English
5
19
158
25.8K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
It's been awesome to see Transformers getting support for more and more quantisation methods. And I've loved collaborating with @younesbelkada and @huggingface again! All my AWQ uploads now support Transformers. READMEs will update soon to show a Transformers Python example.
younes@yb2698

Few months ago, researchers from MIT-Han Lab released AWQ The method is now supported in 🤗 transformers library ! As simple as 1- `pip install autoawq` or install llm-awq kernels and 2- call `from_pretrained` A great work from MIT-Han lab folks, Casper Hansen & @TheBlokeAI 🧵

English
3
24
153
57.2K
Tom Jobbins retweetledi
Chirper
Chirper@chirperai·
Have you heard about Chirper worlds? 👀🌐
Ryan Lazuka@lazukars

Chirper.ai just launched its revolutionary new software feature, "Worlds." This feature allows users to create their own virtual worlds and play god of AI-driven bots. To learn more, check out my podcast about "Worlds" here: youtu.be/yDAwmzUvcM8

English
3
8
27
16.7K
Tom Jobbins retweetledi
Victor M
Victor M@victormustar·
🤔 Are you interested in a "Follow" feature on the Hugging Face Hub? ➡️ This will allow you to see new models/records/spaces from users you follow.
Victor M tweet media
English
15
9
102
39.1K
Tom Jobbins retweetledi
Julien Chaumond
Julien Chaumond@julien_c·
oh hello @TheBlokeAI I want to bookmark your 'Recent models' Collection on @huggingface 🔥 Well... you can now upvote Collections! and browse upvoted collections on your profile ❤️
Julien Chaumond tweet media
English
2
9
47
12.8K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@natserran0 Glad you found the quantization useful. All credit for the quality of the model goes to its creators! And yes that model is still very popular after many months.
English
1
0
0
181
Tom Jobbins
Tom Jobbins@TheBlokeAI·
Thanks again to @latitudesh for the loan of a beast 8xH100 server this week. I uploaded over 550 new repos, maybe my busiest week yet! Quanting is really resource intensive. Needs not only fast GPUs, but many CPUs, lots of disk, and 🚀 network. A server that ✅ all is v. rare!
English
14
13
241
31.9K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@vanstriendaniel Aw shucks! BTW, are you involved with the Librarian Bot that sends PRs asking people to add base_model to YAML? If so, FYI last week I updated my code so I now link to the source model (the model I quantised) using base_model - hope you can use this data somehow!
Tom Jobbins tweet media
English
2
0
8
1.4K
Tom Jobbins retweetledi
Arena.ai
Arena.ai@arena·
🔥Excited to introduce LMSYS-Chat-1M, a large-scale dataset of 1M real-world conversations with 25 cutting-edge LLMs! This dataset, collected from chat.lmsys.org, offers insights into user interactions with LLMs and intriguing use cases. Link: huggingface.co/datasets/lmsys…
English
9
84
360
96.1K
Tom Jobbins retweetledi
younes
younes@yb2698·
New feature alert in the @huggingface ecosystem! Flash Attention 2 natively supported in huggingface transformers, supports training PEFT, and quantization (GPTQ, QLoRA, LLM.int8) First pip install flash attention and pass use_flash_attention_2=True when loading the model!
younes tweet media
English
8
99
507
115.3K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@SebastianB929 @teknium @latitudesh No, I've not tried LMDeploy properly yet. I tried it briefly once but I was getting terrible performance and I didn't have time to investigate it further. I know they claim a lot but I've not been able to verify it myself yet
English
0
0
2
118
Tom Jobbins
Tom Jobbins@TheBlokeAI·
It's the AWQpocalypse! I've cranked the handle and AWQs are flooding HF. Why now? New library AutoAWQ provides turbo-charged Transformers-based inference, and vLLM now supports AWQ for multi-user inference serving. Making 8 at once on a beautiful 8xH100 server from @latitudesh
Tom Jobbins tweet mediaTom Jobbins tweet media
English
9
14
96
21K
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@teknium @latitudesh It can. Currently it doesn't scale quite as well as unquantised, so best performance is still fp16. But it does enable using smaller hardware, which could work out cheaper overall, and often has much easier availability.
English
1
0
3
421
Teknium (e/λ)
Teknium (e/λ)@Teknium·
@TheBlokeAI @latitudesh hmm I'm aware vLLM has the continuous batching capabilities, but TGI, using 4bit bnb can't do it, while obv fp16 can, so I wasn't sure that awq could even if part of vllm
English
1
0
4
573
Tom Jobbins
Tom Jobbins@TheBlokeAI·
@teknium @latitudesh vLLM is a continuous batching server, yes. AWQ is not faster than standalone ExLlama for batch size 1 but in a continuous batching scenario yes it would be - ie vLLM with AWQ will outperform TGI using GPTQ + ExLlama kernel. But for max bsz=1 throughput, ExLlama still rules all.
English
1
1
5
1.8K