Joseph

390 posts

Joseph

Joseph

@RealJosephus

How dare I teach robots how to learn.

Katılım Temmuz 2017
9 Takip Edilen2.9K Takipçiler
Joseph
Joseph@RealJosephus·
@nickhistgeek then glm-image did the same, better, in 9B params.
English
0
0
1
28
Nicholas Hyperion
Nicholas Hyperion@nickhistgeek·
@RealJosephus Harsh but fair on the image quality. The real question is whether native multimodal under one autoregressive objective is worth the tradeoffs. That's the actual experiment here.
English
1
0
1
25
Joseph retweetledi
Mayank Mishra
Mayank Mishra@MayankMish98·
We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.
English
17
73
745
371.3K
Joseph retweetledi
Niels Rogge
Niels Rogge@NielsRogge·
For people thinking that DeepSeek-OCR is the first model to render text as images, the University of Copenhagen already did this in 2023 Paper is called "Language Modelling with Pixels". They trained a Masked AutoEncoder (MAE) by rendering text as images and masking patches
Niels Rogge tweet media
English
22
51
526
44.9K
Joseph
Joseph@RealJosephus·
Deepseek dropped the OCR model they trained last year. Against VL models, they highlight OCR; against OCRs, they highlight conv downsampling tokencount, yet more params. Quite a scene watching people's reaction. Model's only better than 0.9b paddle when it comes to math formulas.
Zephyr@zephyr_z9

Interesting Baidu has a better OCR than Whale

English
1
0
4
1.7K
Joseph
Joseph@RealJosephus·
@teortaxesTex Gotta whisper this: under heavy data costs way > training, spectral norm is basically the worst you could do from a 2nd-moment view, no? Can't splash cash on data & defend burning it.
English
0
0
2
692
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
TM is quickly becoming the Western lab publishing what looks most like actual frontier research. One has to imagine from snide remarks that GDM/OAI/xAI are solving similar problems, by similar means.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Thinking Machines@thinkymachines

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. thinkingmachines.ai/blog/modular-m… We explore a fundamental understanding of the geometry of neural network optimization.

English
6
21
288
26.9K
Joseph
Joseph@RealJosephus·
@teortaxesTex For celebrity identification, it's pretty sure the model just learned about the name tags. It knows which face goes with which name, but it doesn't actually know who the person is. Typical Qwen superficial work.
Joseph tweet media
English
0
1
20
6.8K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Qwen never beating the allegations But I think it's both seen the test set and everything else As I've said, Qwen has cracked the secret: pretraining on *all test sets, including future ones*, is all you need. Good procedural data goes a long way
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
English
11
11
293
46.4K
Joseph
Joseph@RealJosephus·
We're entering an age where upstream compute will eclipse the pretraining itself. Most aren't prepared. Let alone the fact that some of the biggest labs can't even produce a snapshot of the "entire web" up to a recent date. That's definitively 2 versions out of date.
English
1
0
5
650
Joseph
Joseph@RealJosephus·
Honestly, it's sickening to see people with no linguistics background pontificating on the future of LLMs, or those with no neuroscience background holding forth on AI and AGI as if they're experts.
English
0
1
6
1.3K
Lincoln 🇿🇦
Lincoln 🇿🇦@Presidentlin·
The CCP forced Whale to use Huawei Ascend chips is catnip. It explains everything, why are they taking so long, cause Ascend chips. Why are they not publishing new research, cause Ascend chips. They are spending so much time getting chips to work, they can barely focus on anything. Perfect catnip.
English
3
0
12
4.8K
Lincoln 🇿🇦
Lincoln 🇿🇦@Presidentlin·
@RealJosephus Speculating or you saw something. He is posting images, so it's not a VL like I thought but an image model.
English
1
0
1
155
Joseph
Joseph@RealJosephus·
If it messes up the order, chances are it has conflicting information from a bad update (like new wiki data layered on old data). If it gets it right, its knowledge is current as of late 2024. If its answer is old (from before 2023), its knowledge probably cuts off in 2022.
English
0
0
3
595
Joseph
Joseph@RealJosephus·
Vietnam has gone through 4 presidents in 2024. That's a great trivia question. Answered Nguyễn Xuân Phúc (04.2021 - 01.2023).
Joseph tweet media
Tiếng Việt
2
0
5
869
sway
sway@SwayStar123·
@RealJosephus Is the model hallucinating or are you
sway tweet media
English
1
0
1
450
Joseph
Joseph@RealJosephus·
For the next 5-10 years, we will be haunted by 2022...
English
0
0
9
778
Joseph
Joseph@RealJosephus·
@teortaxesTex nay, garbage 'audio head parallel decoder' with ~1T tokens wasted. see: huggingface.co/moonshotai/Kim… might be a peculiar curse, but all the models with this architecture that I've observed have been poorly trained...
Joseph tweet media
English
1
0
2
257
Joseph
Joseph@RealJosephus·
now it also applies to inference, tokenizers... Know your inference - vllm, *.cpp, ggml, hf? Everyone believes their inference implementation is correct. Verify it yourself. Otherwise, you're no different from someone who types `ollama run deepseek-r1`. x.com/RealJosephus/s…
Joseph@RealJosephus

This suggests that, in reality, NO 'serious' LLM training is actually centered around the Hugging Face ecosystem - many who claim to surpass Meta LLaMA3.1, don't even know how to train a model properly - script kiddies

English
0
0
4
750