Tom Jobbins

336 posts

Tom Jobbins

@TheBlokeAI

My Hugging Face repos: https://t.co/yh7J4DFGTc Discord server: https://t.co/5h6rGsGfBx Patreon: https://t.co/yfQwFggGtx

UK Katılım Temmuz 2010

226 Takip Edilen15.4K Takipçiler

Tom Jobbins@TheBlokeAI·18 Ara

@0xSage @huggingface @janhq_ @greennode23 Is jan.ai compatible with all GGUF models? If so I can link it in my READMEs if you like

English

1.6K

Nicole Zhu 👋@freelerobot·17 Ara

Our holiday gift 🎁 to the open source AI community: 3 new models currently topping @huggingface LLM leaderboards! Run them on jan.ai (beta) 👀 We'll be launching a few more projects at @janhq_ this December, so stay tuned! 🙏 Credit: @TheBlokeAI @greennode23

👋 Jan@jandotai

👋 Meet Trinity, our experimental LLM that's #1 and #2 on the @huggingface OpenLLM Leaderboard. Trinity was created by merging LLMs with different strengths and weaknesses using SLERP. Here's how we did it: 🧵 Credit: @HaHoang411, @pokachi2023, @vuonghoainam

English

Tom Jobbins@TheBlokeAI·15 Ara

@narsilou Will look at Medusa shortly!

English

1.1K

Nicolas Patry@narsilou·15 Ara

This week was a good week: - Speculation runs on TGI (Medusa, ngram). Up to 3x speedup for all LLMs. (@TheBlokeAI we should get more medusa out). - Mixtral released on day1: fastest way to run it (quantized, speculation out of the box). - Metal flash integration in candle.

English

10.2K

Tom Jobbins retweetledi

emozilla@theemozilla·14 Ara

FYI to anyone using @MistralAI's Mixtral for long context tasks -- you can get even better performance by disabling sliding window attention (setting it to your max context length) config.sliding_window = 32768

English

412

58.9K

Tom Jobbins@TheBlokeAI·14 Ara

Transformers now supports Mixtral GPTQs and I've updated my READMEs accordingly. It was awesome working with @_marcsun and @younesbelkada of @huggingface on this! Credit to LaaZa for coding the AutoGPTQ quant and inference implementation which enabled me to get GPTQs out fast!

Marc Sun@_marcsun

Announcing 4-bit Mixtral 8x7B on 🤗Transformers! Run the new Mistal MoE with minimal performance degradation on your local computer (24Go) 🔥 Stay tuned as more quants are coming soon using AWQ. We are also looking into sparsification with @Tim_Dettmers huggingface.co/TheBloke/Mixtr…

English

130

41.5K

Tom Jobbins@TheBlokeAI·13 Ara

@diynikola @1littlecoder

QME

510

mr.niko.la@diynikola·12 Ara

@1littlecoder He is mostly likely AGI. Absolutely machine. What's his workstation look like ? @TheBlokeAI

English

346

1LittleCoder💻@1littlecoder·12 Ara

THIS MAN DOESN'T REST!!! The Mixtral GGUF is here -huggingface.co/TheBloke/Mixtr…

English

107

7.2K

Tom Jobbins retweetledi

Aleksa Gordić (水平问题)@gordic_aleksa·7 Ara

@TheBlokeAI joined me to share his work in the open-source AI space - don't miss it! happening right now server link: discord.gg/peBrCpheKE (see the general channel or events channel for google meet link)

English

15.8K

Tom Jobbins@TheBlokeAI·7 Ara

@MTrofficus You're much too kind - I've merely played a small part in pushing forward the wave. Remember that without the model creators, I'd have nothing to quantise! :) And without the model training code, they'd not be able to train. And so on We're all doing our bit in our own ways 🚀

English

811

Miguel Otero Pedrido@moteropedrido·7 Ara

Many say that the explosion of LLMs has been thanks to the Transformer architecture. ⚠️ You are terribly wrong ⚠️ It was thanks to @TheBlokeAI

English

959

Tom Jobbins retweetledi

younes@yb2698·5 Ara

Blazing fast text generation using AWQ and fused modules! 🚀 Up to 3x speedup compared to native fp16 that you can use right now on any models supported by @TheBlokeAI Simply pass an `AwqConfig` with `do_fuse=True` to `from_pretrained` method! #make-use-of-fused-modules" target="_blank" rel="nofollow noopener">huggingface.co/docs/transform…

GIF

English

158

25.8K

Tom Jobbins@TheBlokeAI·10 Kas

It's been awesome to see Transformers getting support for more and more quantisation methods. And I've loved collaborating with @younesbelkada and @huggingface again! All my AWQ uploads now support Transformers. READMEs will update soon to show a Transformers Python example.

younes@yb2698

Few months ago, researchers from MIT-Han Lab released AWQ The method is now supported in 🤗 transformers library ! As simple as 1- `pip install autoawq` or install llm-awq kernels and 2- call `from_pretrained` A great work from MIT-Han lab folks, Casper Hansen & @TheBlokeAI 🧵

English

153

57.2K

Tom Jobbins retweetledi

Chirper@chirperai·15 Eyl

Have you heard about Chirper worlds? 👀🌐

Ryan Lazuka@lazukars

Chirper.ai just launched its revolutionary new software feature, "Worlds." This feature allows users to create their own virtual worlds and play god of AI-driven bots. To learn more, check out my podcast about "Worlds" here: youtu.be/yDAwmzUvcM8

English

16.7K

Tom Jobbins retweetledi

Victor M@victormustar·2 Eki

🤔 Are you interested in a "Follow" feature on the Hugging Face Hub? ➡️ This will allow you to see new models/records/spaces from users you follow.

English

102

39.1K

Tom Jobbins retweetledi

Julien Chaumond@julien_c·27 Eyl

oh hello @TheBlokeAI I want to bookmark your 'Recent models' Collection on @huggingface 🔥 Well... you can now upvote Collections! and browse upvoted collections on your profile ❤️

English

12.8K

Tom Jobbins@TheBlokeAI·25 Eyl

@natserran0 Glad you found the quantization useful. All credit for the quality of the model goes to its creators! And yes that model is still very popular after many months.

English

181

📈Nat Serrano@natserran0·25 Eyl

@TheBlokeAI i just tested your Wizard Vicuna 30B and it’s amazing.

English

161

Tom Jobbins@TheBlokeAI·24 Eyl

Thanks again to @latitudesh for the loan of a beast 8xH100 server this week. I uploaded over 550 new repos, maybe my busiest week yet! Quanting is really resource intensive. Needs not only fast GPUs, but many CPUs, lots of disk, and 🚀 network. A server that ✅ all is v. rare!

English

241

31.9K

Tom Jobbins@TheBlokeAI·23 Eyl

@vanstriendaniel Aw shucks! BTW, are you involved with the Librarian Bot that sends PRs asking people to add base_model to YAML? If so, FYI last week I updated my code so I now link to the source model (the model I quantised) using base_model - hope you can use this data somehow!

English

1.4K

Daniel van Strien@vanstriendaniel·21 Eyl

19,893 likes ❤️ across 1,635 models 🤯🤯 🏅 @TheBlokeAI is ranked first in total likes for models on HuggyRanker! Check it out here: huggingface.co/spaces/librari… If you've ever wanted a quantized version of a model, chances are @TheBlokeAI has already created it!

English

13.5K

Tom Jobbins retweetledi

Arena.ai@arena·22 Eyl

🔥Excited to introduce LMSYS-Chat-1M, a large-scale dataset of 1M real-world conversations with 25 cutting-edge LLMs! This dataset, collected from chat.lmsys.org, offers insights into user interactions with LLMs and intriguing use cases. Link: huggingface.co/datasets/lmsys…

English

360

96.1K

Tom Jobbins retweetledi

younes@yb2698·22 Eyl

New feature alert in the @huggingface ecosystem! Flash Attention 2 natively supported in huggingface transformers, supports training PEFT, and quantization (GPTQ, QLoRA, LLM.int8) First pip install flash attention and pass use_flash_attention_2=True when loading the model!

English

507

115.3K

Tom Jobbins@TheBlokeAI·20 Eyl

@SebastianB929 @teknium @latitudesh No, I've not tried LMDeploy properly yet. I tried it briefly once but I was getting terrible performance and I didn't have time to investigate it further. I know they claim a lot but I've not been able to verify it myself yet

English

118

SebastianBoo@SebastianB929·20 Eyl

@TheBlokeAI @teknium @latitudesh Do you have any comparison to lmdeploy? Quantization should be available as well as continuous batching

English

119

Tom Jobbins@TheBlokeAI·19 Eyl

It's the AWQpocalypse! I've cranked the handle and AWQs are flooding HF. Why now? New library AutoAWQ provides turbo-charged Transformers-based inference, and vLLM now supports AWQ for multi-user inference serving. Making 8 at once on a beautiful 8xH100 server from @latitudesh

English

21K

Tom Jobbins@TheBlokeAI·19 Eyl

@teknium @latitudesh It can. Currently it doesn't scale quite as well as unquantised, so best performance is still fp16. But it does enable using smaller hardware, which could work out cheaper overall, and often has much easier availability.

English

421

Teknium (e/λ)@Teknium·19 Eyl

@TheBlokeAI @latitudesh hmm I'm aware vLLM has the continuous batching capabilities, but TGI, using 4bit bnb can't do it, while obv fp16 can, so I wasn't sure that awq could even if part of vllm

English

573

Tom Jobbins@TheBlokeAI·19 Eyl

@teknium @latitudesh vLLM is a continuous batching server, yes. AWQ is not faster than standalone ExLlama for batch size 1 but in a continuous batching scenario yes it would be - ie vLLM with AWQ will outperform TGI using GPTQ + ExLlama kernel. But for max bsz=1 throughput, ExLlama still rules all.

English

1.8K

Teknium (e/λ)@Teknium·19 Eyl

@TheBlokeAI @latitudesh Is it capable of continuous batching? Faster than exllama(v2)?

English

895

Keşfet

@0xSage @huggingface @greennode23 @narsilou @MistralAI @_marcsun @diynikola @1littlecoder