CKtalon

80 posts

CKtalon

@CKtalon

Singapore Katılım Eylül 2008

143 Takip Edilen61 Takipçiler

CKtalon@CKtalon·7 Oca

@abacaj That’s if they even release a GPU

English

anton@abacaj·7 Oca

why can't AMD do that???

English

2.5K

anton@abacaj·7 Oca

idk how he does it, every time he releases a new GPU, I want to buy it

English

152

CKtalon@CKtalon·14 May

@RealJosephus @suchenzang Seems like the corpus used to train the tokenizer isn’t as clean as the corpus used to train the LLM

English

374

Joseph@RealJosephus·14 May

@suchenzang <114900> -> “最新高清无码” = “中国特色社会主义”

中文

61.3K

Susan Zhang@suchenzang·14 May

this new "o200k_base" vocab for gpt-4o makes me want to clutch my pearls

English

725

310.2K

CKtalon@CKtalon·14 May

@drummatick @laurensweitkamp @suchenzang With 200k vocabulary, it’s entirely possible to have many full words

English

Saurabh Kumar@drummatick·14 May

@laurensweitkamp @suchenzang Which tokenizer? The whole idea behind sub-wording was to make we learn the representation better

English

387

CKtalon@CKtalon·19 Nis

@SashaMTL Just stating facts. BLOOM having 1131 citations despite being released in 2022 while Llama2 having 3855 despite being released 8 months later. BLOOM was just severely undertrained with the amount of limited compute they had, with way too much ambition to do so many languages.

English

121

Sasha Luccioni, PhD 🦋🌎✨🤗@SashaMTL·19 Nis

@CKtalon Time will tell! But also, not very nice of you.

English

364

Sasha Luccioni, PhD 🦋🌎✨🤗@SashaMTL·18 Nis

So LLaMa 3's carbon footprint is... huge? 🤯 They estimate it to be 2,290 tons of CO2eq, compared to 550t for training GPT-3 and 66t for training *all* of the BLOOM models (1B-176B) 🌬️

English

245

134.5K

CKtalon@CKtalon·29 Oca

@dctanner @mov_axbx That’s a really expensive server on eBay considering its age and specs. It seems any used rack that can hold greater than 4 GPUs are highly inflated in price now.

English

Damien C. Tanner@dctanner·29 Oca

@CKtalon @mov_axbx SuperServer 4029GP-TRT2

English

Nathan Odle@mov_axbx·28 Oca

Waiting on a GPU and an electrician but this thing is about ready. I’ve gotten a lot of questions about this AMD Epyc 7x4090 build, thinking of doing a Spaces on parts selection, etc if you guys think you’d be interested

English

430

74.6K

CKtalon@CKtalon·29 Oca

@dctanner @mov_axbx What server rack model is that?

English

Damien C. Tanner@dctanner·28 Oca

@mov_axbx Lovely rig. This week I managed to find 8x slim 3090 turbos. This let me squeeze them into a super server. Was planning to go open rig if I hadn’t found the turbos.

English

228

CKtalon@CKtalon·26 Oca

@realmrfakename @arpagon Not possible

English

Sebastian Rojo@arpagon·25 Oca

After 20 years of Linux loyalty, here I am, tempted by Apple's MLX for local inference - ($3,199) MacBook Pro M3 Max - ($2,500) 2x RTX 3090 Thanks, Apple MLX, for my existential tech crisis. 🙃

English

10.3K

CKtalon@CKtalon·10 Oca

@NVIDIAGeForce #RTXSUPER

QME

NVIDIA GeForce@NVIDIAGeForce·9 Oca

We’re giving you TWO ways to WIN a one-of-a-kind GeForce RTX 4080 SUPER signed by NVIDIA CEO, and founder, Jensen Huang 👀 If you’re at CES head to our partner booths to enter 👉 nvidia.com/en-us/geforce/… Want to WIN here on social? ⚫Comment #RTXSUPER ⚫Like this post

English

41.7K

3.6K

45.1K

1.8M

CKtalon@CKtalon·6 Oca

@Yampeleg @abacaj Did similar trainings, and from some manual evaluations, the loss might have plateaued for hundreds of thousands of steps, but the quality of the generations are better given more epochs.

English

491

Yam Peleg@Yampeleg·5 Oca

@abacaj I have a 100m translation model I trained for 2 month+ on ~18B tokens infinite loop of epochs. It got stuck. Full stop, the loss doesn't move. (No matter what trick I tried: batch_size, grad noise/clip, lr, w_decay..) There is a HARD limit to finite params..

English

40.1K

anton@abacaj·5 Oca

The tinyllama model clearly shows that small models are actually *saturating* in terms of performance... it's 1.55% better than OPT? Tinyllama was trained on a whopping 16x more tokens...

English

373

69.9K

CKtalon@CKtalon·16 Kas

@charlieholtz @elevenlabs In the not-so-distant future, pairing this with the Meta Ray-Bans and have it narrate whatever you see will be mind-blowing.

English

Charlie Holtz@charlieholtz·15 Kas

David Attenborough is now narrating my life Here's a GPT-4-vision + @elevenlabs python script so you can star in your own Planet Earth:

English

690

4.4K

25.8K

CKtalon@CKtalon·6 Kas

@Suhail Helps when a ton of data is distilled from a powerful LLM? Phi-1.5 kinds of shows that generative data can produce a powerful model.

English

Suhail@Suhail·5 Kas

It’s interesting that it only takes 4 mo now to train an LLM to GPT 3.5/Llama 2 from scratch. Prior to Jan this year, nobody had practically replicated GPT-3 still. It doesn’t seem like the lead of GPT-4 will last too much longer.

English

1.1K

369.5K

CKtalon@CKtalon·23 Eki

@BramVanroy OpenNMT does have most of those implemented since they are also now supporting LLMs. Marian looks dead, perhaps due to lowered importance by MSFT in preference of LLMs.

English

CKtalon@CKtalon·13 Ağu

@Yampeleg Just the preview shows how dirty the dataset is…

English

411

Yam Peleg@Yampeleg·12 Ağu

The most powerful open source instructions dataset: Flan. 378 Million samples. (~300GB) [1] - Link: huggingface.co/datasets/Open-… Why should you care? 🤔 - Flan is an incredibly powerful dataset [2] and some famous models trained on it (FlanT5, UL2..) hold the top positions on various leaderboards to this day. - The main reason for it is the quality and diversity of the data. - It is huge: Ever wondered "What would happen if we just merged all instructions datasets together into a single huge one?", this is basically the motivation behind the Flan dataset. - It is balanced (!!) which promotes the models trained on it to generalize better to arbitrary tasks down the line. Flexibility: - Zero-Shot vs Few-Shot: For many of the tasks you can fetch the same task either for Zero-Shot: No solved for demonstration or Few-Shot. - Chain of thought built in on some of the tasks. The "next step".. A small part of Flan had been augmented with additional explanations in the past. The result of this was the first model ever to rival ChatGPT on vicuna's benchmark. And again.. This was just a small part of Flan.. ---- [1] ai.googleblog.com/2023/02/the-fl… [2] arxiv.org/pdf/2301.13688… (* This paper is a must if you are building text datasets)

English

121

608

146.2K

CKtalon@CKtalon·23 Tem

@Science_boy_H @huggingface I’m suspecting QLoRA or LoRA doesn’t help for adding/increasing a model’s second language capabilities

English

FP32 Monastic@kv_cached·21 Tem

I just finetuned Llama2 on Arabic dataset using Qlora and sfttrainer GitHub link : github.com/h9-tect/llama2… @huggingface link : huggingface.co/HeshamHaroon/l…

English

113

33.3K

CKtalon@CKtalon·24 Haz

@DanielSMatthews @nearcyan Text isn’t one to one. More of a translated summary

English

𝑫𝒂𝒏𝒊𝒆𝒍 𝑺𝒄𝒐𝒕𝒕 𝑴𝒂𝒕𝒕𝒉𝒆𝒘𝒔 🇦🇺@DanielSMatthews·24 Haz

@nearcyan So if you fed the original and the translation into an AI training session you'd have an AI that did a good job of translating Chinese original science to English?

English

277

near@nearcyan·24 Haz

arxiv papers are translated to chinese and posted in china immediately after publication, but there's very little information flow in the opposite direction

English

382

53.5K

CKtalon@CKtalon·22 Haz

@_BruceX_ @ID_AA_Carmack Time is money

English

Bruce X@_BruceX_·21 Haz

@ID_AA_Carmack But this is twice the FLOPs for three times the cost, right?

English

1.7K

John Carmack@ID_AA_Carmack·21 Haz

H100 GPUs are very fast! For those unfamiliar with GPU matrix multiplies, the jaggies in the graph relate to packing occupancy, and are not noise. You can’t just divide theoretical teraflops by your problem size and get accurate times.

English

758

242.6K

CKtalon@CKtalon·13 Haz

@e270889o @ID_AA_Carmack Plenty of ram, but slow compute-wise. Apple’s CoreML is too opaque to developers, so the Neural Engine hasn’t been usable in an obvious way yet.

English

John Carmack@ID_AA_Carmack·12 Haz

I’m a little surprised there isn’t more excitement around Nvidia’s 256 GPU unified memory NVLink clusters, but I have heard from a couple places that it is considered constraining. I wonder if the optics are that your cluster isn’t serious with less than a four figure GPU count.

English

418

144.3K

CKtalon@CKtalon·13 Haz

@decryption @coxymla Follow this: applegamingwiki.com/wiki/Game_Port…

English

CKtalon@CKtalon·1 Haz

@tmophoto @abacaj You split the layers across different cards. That’s why you need fast interconnects like NVLink so that the GPUs can process the computations quickly without bottlenecks.

English

tmo@tmophoto·31 May

@CKtalon @abacaj How do you load models larger than the size of the vram in the card? That is the biggest limiting factor right now. Consumer gpus top out at 24 gb and we need 48-80+ gb to load a model

English

anton@abacaj·30 May

Stack up on GPUs

English

131

27.5K

Keşfet

@abacaj @RealJosephus @suchenzang @drummatick @laurensweitkamp @SashaMTL @dctanner @mov_axbx