Haz Sameen Shahgir

63 posts

Haz Sameen Shahgir banner
Haz Sameen Shahgir

Haz Sameen Shahgir

@sameen2080

PhD Student @UCRiverside, intern @Amazon, undergrad @BUET FromSoft enjoyer

انضم Temmuz 2023
137 يتبع29 المتابعون
will brown
will brown@willccbb·
i am no longer “that one morgan stanley guy who posts fun open-source grpo experiments” there are more of us
will brown tweet media
English
11
9
403
33.6K
Omar Sanseviero
Omar Sanseviero@osanseviero·
Which are your top 5 ML dramas?🍿 1. Llama and Zetta llama drama 2. What did Ilya see? 3. StabilityAI take-down of Runway Stable Diffusion 4. Hugging Face removal of GPT-4chan 5. Schmidhubering
English
49
10
252
30.9K
dr. jack morris
dr. jack morris@jxmnop·
another incredible thing about deepseek: all the american AI labs compete to hire the top PhD researchers - but deepseek didn’t compete deepseek researchers aren’t top PhDs. most are not even PhDs
English
258
286
6K
807.9K
Haz Sameen Shahgir
Haz Sameen Shahgir@sameen2080·
@teortaxesTex @ericjang11 "After repeatedly changing his degree between different subjects like natural sciences, history of art, and philosophy, he eventually graduated with a BA degree in experimental psychology in 1970" - Wikipedia Checks out.
English
0
0
2
75
Eric Jang
Eric Jang@ericjang11·
The opening sentence goes so hard. This paper was 10 years ahead of its time.
Eric Jang tweet mediaEric Jang tweet media
English
36
365
5.1K
314.7K
Daniel Han
Daniel Han@danielhanchen·
@andrew_n_carr Coincidentally I literally did an entire final year uni project for this :) Also why not inverse and why QR. Or why divide n conquer SVD is faster. Or if matrix is < 2000 cols use Cholesky via POTRF and SSYRK or do column pivoting etc And LSQR LSMR, sparse methods etc! Fun!!
English
8
8
184
29.3K
Andrew Carr 🤸
Andrew Carr 🤸@andrew_n_carr·
Another great interview question! For linear regression, we can directly compute the minimizer as β= = (X^TX)^{-1}X^Ty. So why do we often use gradient descent instead?
English
95
64
1.5K
343.7K
Michael Saxon
Michael Saxon@m2saxon·
@sameen2080 I see, I hadn't been following closely enough to be aware that qwq (indisputably a model) is considered a reasoning model Regarding o1 though, I think that hiding the chain of thought tokens is a significant enough intervention on the raw outputs of the model to be a system
English
1
0
0
82
Michael Saxon
Michael Saxon@m2saxon·
Can someone explain why o1 and ilk are described as "reasoning models" and not as "reasoning systems?" Isn't it a LM inside of a bigger structure?
English
4
0
11
1.4K
Justine Moore
Justine Moore@venturetwins·
ChatGPT refuses to say the name “David Mayer,” and no one knows why. If you try to get it to write the name, the chat immediately ends. People have attempted all sorts of things - ciphers, riddles, tricks - and nothing works.
Justine Moore tweet mediaJustine Moore tweet media
English
3.1K
3.7K
52.9K
10.5M
Haz Sameen Shahgir
Haz Sameen Shahgir@sameen2080·
@pranjalssh Excellent work. Couple of questions tho "...hence tensor core instructions require storing C over 128 threads in a SM" - Shouldn't it be 1024/256=4? "When we distribute C over a warp-group, each thread needs 1024/128 = 8 threads " - what does each thread needing 8 threads mean?
English
1
0
1
694
Pranjal
Pranjal@pranjalssh·
I implemented H100 cuda matmul kernel from scratch, taking inspiration from @Si_Boehm's blog. Our final kernel outperforms cuBLAS by 7% for N=4096. It fits in a single C++ file without any dependencies. Full-blown blog post with all details: cudaforfun.substack.com/p/outperformin…
English
32
30
288
49.9K
Simon Willison
Simon Willison@simonw·
After hassling Anthropic for months for a token counting library similar to OpenAI's tiktoken I just realized the Anthropic and Gemini approach of providing a free token counting API is actually better... because I don't know how to use tiktoken to count tools, images etc
English
13
11
251
30.8K
Michael Saxon
Michael Saxon@m2saxon·
"IllusionVQA": Haz Sameen Shahgir, Khondker Salman Sayeed et al Testing VLM reasoning over optical illusion questions. For some *perceptual* illusions (same size, color) VLMs are superhuman, but for *logical* ones like "impossible shapes" they're worse. openreview.net/forum?id=7ysaJ…
Michael Saxon tweet media
English
1
0
13
837
Delip Rao e/σ
Delip Rao e/σ@deliprao·
is there an llm finetuning service that will accept my data, train an open model (say llama3.2), and allow me to download the trained model?
English
80
28
797
246.9K
Teknium (e/λ)
Teknium (e/λ)@Teknium·
People probably don’t got enough questions to make it think as long as they expected (aka we’re all too dumb for it already)
English
27
4
228
10.7K
Haz Sameen Shahgir
Haz Sameen Shahgir@sameen2080·
@hu_yifei Yeah, current LLMs have really poor support for Bengali. NLLB was careful about this and upsampled Bengali. NLLB character/token for Bengali is 3.35 (higher the better) LLaMA-3, Qwen2, Mistral and Aya are all at about ~0.8. For reference English Char/Token is around ~4.5
English
0
0
2
65
Yifei Hu
Yifei Hu@hu_yifei·
Since I am working on multilingual stuff, I translated a piece of text from an academic paper to different languages. It seems like tokenizers are not friendly to certain languages despite they are among the most spoken languages in the world. People who speak Hindi or Bengali can confirm on this?
Yifei Hu tweet mediaYifei Hu tweet mediaYifei Hu tweet media
English
9
2
29
6.6K
Haz Sameen Shahgir
Haz Sameen Shahgir@sameen2080·
[8/N] 🔍 Finally, we perform **extensive** ablation studies that confirm that training on both BPE+Nucleotide tokens of each sequence is just as good as training two models separately with no performance loss.
English
1
0
0
55
Haz Sameen Shahgir
Haz Sameen Shahgir@sameen2080·
[5/N] 📏 On long RNA sequences, BiRNA-BERT uses BPE to generate compressed sequence embeddings and can process RNA sequences 5 times longer than current SOTA RNA models with the same memory footprint.
English
1
0
0
69