Shady

667 posts

Shady

@ShadyAlii0

Learning, and trying to make the Machine Learn | Research Assistant @MinnesotaNLP

Minneapolis, MN Katılım Mayıs 2025

2.1K Takip Edilen306 Takipçiler

Sabitlenmiş Tweet

Shady@ShadyAlii0·9 Ara

I'm also currently training on one Nvidia DGX Spark, which is limiting the batch size to about 100 samples only. This is not the best size for contrastive learning, as seeing more negatives at a time is more helpful to the learning objective, so I'll probably try distributed training on x2 DGX Sparks!

English

2.2K

Shady@ShadyAlii0·5d

@paladinposts Congrats!

English

imogene the sunbringer 🌞@paladinposts·6d

this paladin is also a scholar

English

1.5K

20K

Shady@ShadyAlii0·5d

It's been a while, and it's kinda scary one can't use the "I'm still an undergrad" shield from this month onwards :D

English

129

Shady@ShadyAlii0·5 May

@idavidrein Their analysis based on IRT looks really interesting. I wonder what was their exact setup & LLM sample size for fitting the IRT models, as I think it can affect the latent ability estimation.

English

david rein@idavidrein·4 May

They use item-response theory (IRT) across a bunch of models and benchmarks to aggregate scores, jointly estimating both task difficulty and model capabilities, and they also have some nice cost-aware comparisons. Their blog post: nist.gov/news-events/ne…

English

913

david rein@idavidrein·4 May

The Center for AI Standards and Innovation (CAISI) is estimating that the rate of progress of Chinese frontier AI is slower than the US's. 16 months ago (January 2025) the gap was ~4 months, now it's ~8 months.

English

3.3K

Shady@ShadyAlii0·2 May

@demisama_ @MSFTResearch Congrats!

English

317

Demi Wang@demisama_·2 May

Life update: After a long, stressful, and busy internship hunt, I'll be joining @MSFTResearch this summer as a Research Intern, working on LLM agents! Would love to connect with ppl in Seattle. I'm into bouldering, poker, food exploring, bar hopping, and occasionally raving :)

English

346

29.7K

Shady@ShadyAlii0·29 Nis

@IanArawjo Thank you! This looks even more interesting now

English

Ian Arawjo@IanArawjo·28 Nis

@ShadyAlii0 The coverage rate of a CI method is the proportion of times that the interval contains the true population parameter upon repeated testing. It's essentially its performance. A 95% CI *should* cover the true mean 95% of the time; these simulations test how true that is.

English

Ian Arawjo@IanArawjo·28 Nis

Re-run the CI methods comparison using real LLM eval data across 15 benchmarks. Here's the plot. The twist—added an empirical likelihood-based method, a Bayesian method with Normal-Inverse-Gamma prior, and a log-transformed t-interval. The latter two crush it—very efficient, too:

English

1.3K

Shady@ShadyAlii0·14 Nis

That’s really interesting! I’ve tried working on the reasoning geometry to predict steps correctness alongside the final result, but it failed in downstream apps I tried. It’s genuinely exciting to see the different layers’ representations here and the idea working for steering too! Congrats!!

English

130

Lihao Sun@1e0sun·14 Nis

How do LLMs do CoT reasoning internally? In our new #ACL2026 paper, we show that reasoning unfolds as a structured trajectory in representation space. Correct and incorrect paths diverge, and we use this to predict correctness before the answer and correct errors mid-flight. 1/

English

288

19.5K

Shady@ShadyAlii0·8 Nis

@literalscientst Wish I could say the same thing about Minnesota

English

Shady@ShadyAlii0·19 Mar

I didn’t read it yet but I think if we want to test the meta cognitive abilities of these models in that case, it’d be more grounded to give them snippets of that rare language’s documentation in context or allow them to retrieve if in agent mode. It’s fair to expect a smart developer to be able to transfer the business logic between languages or “setups”, but they’d still need to read or understand how to express general logic in those languages as well. And maybe for agentic settings you could test them without documentation reference but allow them to plug and play with the code in an environment so they can develop a sense of how the syntax works with feedback which could be a fair comparison too

English

274

kalomaze@kalomaze·19 Mar

not shocking at all; the models don't want to write in byzantine esoteric languages instead of python or rust or whatever

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

256

17.6K

Shady@ShadyAlii0·17 Mar

I’m still having a really hard time myself on this, but something I noticed is just replicating 1-3 papers in that specific I started working/reading on, and doing some extra analysis on those results (not about if they’re similar to the original, but more of what other dimensions I can look at) and from those extra stuff you do and look at, you can have a tighter and written down vision of “what’s missing” or where you could dig deeper. Im too new to this though so idk

English

264

silicognition@silicognition·16 Mar

people who are doing research, how do you go from reading papers & ideation to getting down to something concrete which can be actually done? i have ideas, read a lot of papers but from a fuzzy cloud of insights & inspirations, i would like to get to the finish line help pls!

English

128

1.9K

60.6K

Shady@ShadyAlii0·14 Mar

فرق الريسورز فى امريكا عن مصر مقدرتش استوعبه غير لما خدت بالى إن اول بيبر اشتغلت عليها فى مصر قعدت مع زمايلى نعمل اكونتات على gemini عشان الfree api credits و نلف على الapi keys كل ما واحد منهم يقفل كل ده عشان الجامعة متقدرش تreimburse او تدفعلنا expenses الexperiments. الموضوع مؤسف الحقيقة بذات انه على الحال ده فى مجال resourceful من الاساس و اكيد مجالات تانية بتحتاج lab equipment زى الlife sciences وضعهم اسوء بكتير.

العربية

9.6K

Shady@ShadyAlii0·8 Şub

@ShuaichenChang Hi Shuaichen, your dms are closed :D

English

2.3K

Shuaichen Chang@ShuaichenChang·8 Şub

My team at the AWS AI Lab (based in NYC) is hiring several research interns this summer. I’ll be working closely with a few interns on projects focused on LLM memory and continual learning, aimed at publication. Feel free to DM me if you’re interested, have relevant experience, or would like to refer a student.

English

779

63K

Shady@ShadyAlii0·7 Şub

Going back to playing squash has been the best thing I did in 2026 so far. I missed the game and I kinda look forward to it every week

English

828

Shady@ShadyAlii0·5 Şub

Who even still uses matrix cross product in big 2026

English

330

Shady@ShadyAlii0·5 Şub

@esha_hq Keep cooking

English

168

Esha@esha_hq·5 Şub

so i just caused my entire building to evacuate while meal prepping. easily most embarrassing moment of college but owned it by giving an apology speech to everyone in the apartment lobby. comms right?

English

4.3K

Shady@ShadyAlii0·4 Şub

Wouldn’t AI safety in this context include multiple possible interpretations? Like the software-side safety and making “safe” software with AI, and/or the general safety of generative models that can take actions and “produce” things? I feel both are important in this context but the software part is much less explored/discussed unlike the general safety definition, even though the biggest usecase for these agents and architectures seems to be coding, so far

English

107

Maxime Chevalier@Love2Code·4 Şub

Lots of people blindly believe AI safety is never going to be an issue. Meanwhile top AI labs are using LLMs to write 80% of their code, and said code relies on 5000 poorly-written Python packages that were never audited. As if humans were any good at writing bug-free code.

English

7.1K

Shady@ShadyAlii0·4 Şub

@maharshii Created 0.5 terabyte of embeddings and struggled for a full day just to store and re-read them again for training without waiting half an hour to load a batch into memory

English

155

maharshi@maharshii·4 Şub

the deeper i go the more i feel that 80% of the time is spent on data movement and transformation, while only 20% involves actually computing stuff

English

220

9.9K

Shady@ShadyAlii0·4 Şub

I feel that there’s no shame in that as long as it’s communicated that way. And it’s kinda amazing that we are at that level of brute-forcing things with this kind of intelligence! It feels that the challenge with deploying LLMs for consumption other than chatbots, like in agentic form or scientific discovery, is now more oriented towards developing software capable of utilizing these models “as they are” without demanding what “should be” instead of what is. And maybe intrinsic intelligence will be more long-term achievement (5-10yrs) where we don’t need to rely heavily on external software to steer these models to be productive.

English

272

davinci@leothecurious·4 Şub

tbh this feels far more like brute-forced hill-climbing than the kind of insightful intelligence breakthroughs chollet hoped for this benchmark to incentivize

ARC Prize@arcprize

Johan's submission does a multi-model ensemble. It runs the same task through GPT-5.2, Gemini-3, and Claude Opus 4.5 in parallel. Tries multiple times with different prompting strategies (standard, deep thinking, with images). Then, instead of predicting the grid directly, the LLMs write Python functions that describe the transformation rule, then execute that code in a sandbox to produce the answer. After collecting many candidate answers, separate AI "judge" models evaluate and vote on which solution is most likely correct. See the repo here: github.com/beetree/ARC-AGI

English

222

19.1K

Shady@ShadyAlii0·2 Şub

Trying to customize some stuff in vLLM’s codebase and feeling like a big failure again

Shady@ShadyAlii0

Reading Torch’s codebase and feeling like a big fucking failure of a cs student rn

English

359

Shady@ShadyAlii0·2 Şub

@SherifKozman @MostafaNageeb المشكلة اعتقد فى نفس المكان برضو لا عندك شركات بتبص للoss, ولا عندك budget كويس كا فرد او مجموعة تدفعوا فى api للموديلز الكويسة، ولا برضو عندكم GPUs تستخدموا local models. و معتقدش فى اى grants او funding بيساعد فى الحاجات الى زى دى بأى شكل سواء حكومى او خاص

العربية

KoZman@SherifKozman·2 Şub

@MostafaNageeb كل سنة و انت طيب و حمد الله علي السلامة. مش لازم GPUs أنا بتكلم علي مشاريع زي OpenClawd و Ralph و غيرهم كتير

العربية

1.5K

KoZman@SherifKozman·2 Şub

سؤال بجد، هو ليه مفيش مشاريعAI لطيفة طالعة من مصر زي ما في إسهال Opensource طالع من امريكا و دول تانية؟ ايه اللي موقف الناس انها تعمل بدل الشكوي ان الAI هيقعدهم في البيت ؟

العربية

13.6K

Keşfet

@paladinposts @idavidrein @demisama_ @MSFTResearch @IanArawjo @literalscientst @ShuaichenChang @elonmusk