Brian Dezhou Shen🇨🇳🇬🇧

1.7K posts

Brian Dezhou Shen🇨🇳🇬🇧

@dezhou

Pythonist. Researcher. Data/Computer Scientist. C1@Oxford School of English, United Kingdom. CS Master's@Tsinghua University, China.

Katılım Mart 2009

100 Takip Edilen62 Takipçiler

Sabitlenmiş Tweet

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·27 May

#ChatGPT as an Analyst Motivation I want to ask ChatGPT to estimate the AI market value in 2023 providing the history data. Here is my question, and the generation output of ChatGPT is astonishing.

English

879

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·24 Ara

@calcsam keep it pro, nice topic

English

Sam Bhagwat@calcsam·22 Ara

last month we wrote a new agents book: patterns for building ai agents it has everything you need to take your agents from prototype to production, like agent design patterns, the basics of security, etc reply to this tweet with BOOK and we'll dm you so you can get a copy

English

4.1K

450

5.1K

589K

Brian Dezhou Shen🇨🇳🇬🇧 retweetledi

Zhuang Liu@liuzhuang1234·14 Mar

New paper - Transformers, but without normalization layers (1/n)

English

577

4.1K

1.3M

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·27 Şub

@vllm_project Good

English

vLLM@vllm_project·26 Şub

👀 @vllm_project will be testing and integrating these GEMM kernels ASAP as well.

DeepSeek@deepseek_ai

🚀 Day 3 of #OpenSourceWeek: DeepGEMM Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference. ⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs ✅ No heavy dependency, as clean as a tutorial ✅ Fully Just-In-Time compiled ✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes ✅ Supports dense layout and two MoE layouts 🔗 GitHub: github.com/deepseek-ai/De…

English

251

22.7K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·12 Şub

@BenjaminDEKR good luck to you

English

Benjamin De Kraker@BenjaminDEKR·12 Şub

I resigned from xAI tonight. It makes me very sad, but was the right thing to do -- and here's why. xAI told me I either had to delete the post quoted below, or face being fired. After reviewing everything and thinking a lot, I've decided that I'm not going to delete the post -- which is very clearly a harmless personal opinion. Why did they tell me to remove this opinion? Well, according to them, the reason is that I acknowledged that Grok 3... exists. I wish I was joking. I'm not. That's the reason -- the fact that I wrote "Grok 3 (TBD)" is grounds for being fired. But wait, hasn't Grok 3 been officially acknowledged by xAI? Yes. Yes it has. I'll post below the official xAI blog post talking about Grok 3, along with many public Elon posts and video where it is repeatedly acknowledged. To be clear, the post they wanted me to remove is 100% just my personal opinion. I do not know where Grok 3 will stack up against other SOTA models. Hopefully it does well, I don't know. That's why it says "opinion" and "to be determined." It will probably be pretty good at some things and imperfect at others. I didn't think this was a particularly wild opinion. Again, their official demand said that even writing "Grok 3 - TBD" is somehow "confidential information." This is absolutely absurd, since it's repeatedly been acknowledged by the company and its famous CEO. Are they mad that my clearly-labeled opinion didn't guess that the still-unreleased Grok 3 will be higher? Maybe. Probably. Again, maybe it is at the top, I genuinely don't know. That's why it says "to be determined." The specific feature of Grok I spent the majority of my time working on with a really hard-working team is very cool and I hope it works extremely well for everyone. I won't say what it is because that would be **actual** confidential information. (Maybe after it comes out.) I still hope Elon and xAI win. Yet...... It's very disappointing to me that a company and leaders who supposedly champion free speech and openness would try to fire a low-level employee over a clearly-labeled opinion that contains absolutely nothing controversial, but here we are. The entire situation has been very strange. I thought about just deleting the damn thing.... But you know, once you start caving and giving up holding mild personal opinions, the slope becomes very slippery. I'll keep my speech and dignity and get another job, or build one. Catch ya on the flip side.

Benjamin De Kraker@BenjaminDEKR

The ranking currently (my opinion), for code: ChatGPT o1-pro o1 o3-mini (all kind of tied) Grok 3 (expected, tbd) Claude 3.5 Sonnet DeepSeek GPT-4o Grok 2 Gemini 2.0 Pro Series (might be higher, will probably move up)

English

1.3K

21.7K

6.3M

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·28 Oca

@moyix Cannot use the official api?

English

145

Brendan Dolan-Gavitt@moyix·26 Oca

Who is serving DeepSeek R1 right now (aside from the official API)? It’s an open model so surely lots of other providers have sprung up?

English

4.3K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·12 Ara

@JustinLin610 coder is unique, while llm they use llama3.

English

Junyang Lin@JustinLin610·12 Ara

Qwen2.5-Coder is more popular than LLM? Amazing

Hassan@nutlope

After 350 votes, here are the top open source coding LLMs on CodeArena! Qwen 2.5 Coder 32B is #1 so far, beating out LLMs 10X its size. Followed closely by Llama 3.1 405B & Qwen 2.5 72B.

English

113

10.9K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·9 Ara

@iScienceLuvr and free reading.

English

918

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·9 Ara

A new tutorial on RL by Kevin Patrick Murphy, a Research Scientist at Google DeepMind who also wrote several comprehensive, well-regarded textbooks on ML/DL. This ought to be a good read 👀

Tanishq Mathew Abraham, Ph.D. tweet media

English

266

2.3K

224K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·1 Ara

@WolframRvnwlf @Alibaba_Qwen you deserve more vram

English

Wolfram Ravenwolf@WolframRvnwlf·29 Kas

@dezhou @Alibaba_Qwen Since others are already conducting those evaluations, I focus on testing models that run effectively on my own system (48 GB VRAM). I care more about getting real, practical results than theoretical possibilities. It's what I can evaluate myself and most importantly run myself.

English

198

Wolfram Ravenwolf@WolframRvnwlf·28 Kas

Finished my @Alibaba_Qwen QwQ-32B-Preview benchmark (MMLU-Pro, CS category) just now – remember this is a 32B model at 8-bit EXL2 quantization that's overtaking Llama 405B and 70B, Mistral 123B, and even ChatGPT/GPT-4o in these tests!

English

452

117.7K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·29 Kas

@JustinLin610 Awq has an agenda?

Español

Junyang Lin@JustinLin610·29 Kas

Cool cool! QwQ on HuggingChat!

Victor M@victormustar

Now available on HuggingChat: 🧑‍🚀 Qwen/QwQ-32B-Preview (full precision) hf.co/chat/models/Qw…

English

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·29 Kas

@moyix Finetuning

English

Brendan Dolan-Gavitt@moyix·28 Kas

So we have pre-training. And now, post-training. So when exactly does the training happen???

English

1.6K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·29 Kas

@giffmana bankers are idiots.🎃

English

Lucas Beyer (bl16)@giffmana·28 Kas

This is called banker's rounding and is the only correct way to round numbers. I've used it in all my papers too, even if it sometimes made my results look sliiiiightly worse and hurt :) (banker's rounding: round 5 to even)

yobibyte@y0b1byte

Another one for python lovers @giffmana @HeinrichKuttler

English

16.7K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·27 Kas

@natolambert Someone invented, I guess.

English

Nathan Lambert@natolambert·27 Kas

Question — who came up with the term “post-training?” Emerged in the last 12-18 months but I don’t know where from, and I need to know. 🙇

English

119

26.5K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·27 Kas

@j_foerst @FLAIR_Ox Someone would do so, I believe.

English

238

Jakob Foerster@j_foerst·26 Kas

My group at Oxford (@FLAIR_Ox) is talent rich but GPU poor (both compared to industry), so adding more GPUs would be a win for open science, but is difficult to finance from grants. Does anyone have leads for possible donors? Christmas is coming up so I guess I am allow to dream

English

572

75.9K

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·16 Kas

@JustinLin610 14b 32b needs more tuning, in my opinion.

English

Junyang Lin@JustinLin610·15 Kas

Yay now you guys can use the model on Jan!

👋 Jan@jandotai

Jan v0.5.8 is out: Jan supports Qwen2.5-Coder 14B & 32B through Cortex Highlights 🎉 - A new engine: Jan now run models via @cortex_so - @Alibaba_Qwen's Coder 14B & 32B Support - Supports markdown rendering on user messages and various UI/UX enhancements 💫 Update your product or download the latest. jan.ai

English

Brian Dezhou Shen🇨🇳🇬🇧@dezhou·13 Kas

@victormustar 0.5b,1.5b,3b?

Victor M@victormustar·12 Kas

When going offline, what LLMs are your go-to choices? ✈️

English

2.2K

Brian Dezhou Shen🇨🇳🇬🇧 retweetledi

Adina Yakup@AdinaYakup·12 Kas

Exciting release from @Alibaba_Qwen 🔥 Qwen 2.5-Coder is now live on @huggingface 👉huggingface.co/collections/Qw… ✨ Apache 2.0 license ✨ 0.5B, 1.5B, 3B, 7B, 14B, 32B base & instruct ✨ 128K long context support ✨ SOTA performance on coding benchmarks

English

3.6K

Rohan Paul@rohanpaul_ai·5 Kas

"Attention Is All You Need" paper was truly a landmark paper. However, the original "vanilla" transformers are seldom used now. The huge key upgrade is the use of RoPE, or Rotary Positional Embeddings. **Vanilla Decoder** - Input tokens -> Embeddings -> Embeddings + Positional Encoding -> Decoder Blocks **RoPE Decoder** - Input tokens -> Embeddings -> Decoder Blocks **Rotary Positional Embeddings** RoPE are used in attention blocks, which need to know token positions. Attention blocks combine information from a lot of tokens and need to know their relative positions For example, consider this sentence "It's a big thrill to climb a big mountain." "mountain" should focus more on the nearby "big." RoPE applies a rotational matrix to queries and keys, not values. If "mountain" is the 9th word, it rotates fully, while earlier words rotate less, aligning "mountain" more with the second "big." This approach is efficient as it applies positional embeddings only where needed and keeps token magnitudes unchanged. RoPE scales well to longer contexts, allowing models to be pre-trained on 4k contexts and fine-tuned for up to 4M by adjusting rotation speed.