Michael Smith

139 posts

Michael Smith

@TheCIMSmith

PhD Student, CIM/APL

Montréal, Québec Katılım Aralık 2018

243 Takip Edilen26 Takipçiler

Michael Smith@TheCIMSmith·1h

#CVPR Findings posters are at 7am, per the tentative schedule.💀

English

Michael Smith@TheCIMSmith·24 Nis

@ctocevents @CVPR Thanks! I also assume that CVPR will continue the advanced poster printing offer this year? Is there any information available yet?

English

Nicole Finn@ctocevents·21 Nis

@CVPR @TheCIMSmith Yes

Michael Smith@TheCIMSmith·20 Nis

@CVPR Is there a luggage check on June 7 this year?

English

177

Michael Smith@TheCIMSmith·3 Nis

@abursuc Are you going to CVPR this year?

English

406

Andrei Bursuc@abursuc·3 Nis

Hey @CVPR, CVPR@Paris is back in 2026! It will take place on June 1st in Paris just before the main event; you can make a stop in Paris on your way. There will be poster sessions and a few cool plenary talks. See you there! #cvpr2026 cvprinparis.github.io/CVPR2026InPari…

English

11.2K

Michael Smith@TheCIMSmith·27 Mar

@danidbio I've used gitea in the past. Never heard of Forgejo

English

danidbio@danidbio·27 Mar

that's it, we're looking to migrate off github. Self-hosted forgejo? I'm fine with the minor extra expense and maintenance burden/cost. Tired of putting up with the same issues/stagnation of core features while copilot slop we don't want keeps getting pushed at us

English

Michael Smith@TheCIMSmith·15 Mar

@bnjmn_marie @ysu_ChatData Quantizing also didn't free up enough memory to change # tokens/sec at least for my older hardware.

English

Michael Smith@TheCIMSmith·15 Mar

@bnjmn_marie @ysu_ChatData Anecdotally, running Qwen3.5 35B (UD Q4) with a Q4_0 KV broke down doing some agentic coding somewhere around 32K tokens across multiple attempts. Changing to default llama.cpp KV cache, it worked fine.

English

Benjamin Marie@bnjmn_marie·9 Mar

I was almost certain that Q4 KV-cache quantization would wreck Qwen3.5. It didn’t. So far, I’m not seeing any meaningful accuracy loss. That said, it probably only works for good GGUF versions. If you begin with a poorly quantized Qwen3.5, e.g., Q2, the KV cache is already noisier and much harder to quantize reliably. Note: Again, don't be surprised by the quantized models performing higher than the original. With small benchmarks, it can happen and it's not significant. Changing the seed can also have this effect. Tomorrow, I’ll post a full summary of my Qwen3.5 GGUF evals on my blog (link in bio), including the methodology and new results.

English

219

10.8K

Michael Smith@TheCIMSmith·15 Mar

@bnjmn_marie So pass@k means that each prompt is run through the model k times, and the answer is correct if at least one of the k times it was correct? Following the logic of SPoC (Kulal et al., NeurIPS 2019)? The error bars must be large then...

English

325

Benjamin Marie@bnjmn_marie·15 Mar

Qwen3.5 27B is worse than 397B at coding. But only one retry is enough to erase the gap. LiveCodeBench Accuracy (thinking disabled): - Qwen3.5 27B pass@1: 71 - Qwen3.5 397B pass@1: 79 - Qwen3.5 27B pass@2: 81 - Qwen3.5 27B pass@4: 86 Translation: if you can test the first answer and ask for one more try, 27B gives you about 397B-level coding performance, for way less cost. 4 tries, and you get better results.

English

544

48.1K

Michael Smith@TheCIMSmith·21 Şub

@CVPR @humphrey_shi @chen940382 Thank you for the clarification.

English

122

#CVPR2026@CVPR·21 Şub

@TheCIMSmith @humphrey_shi @chen940382 As this is a new initiative for CVPR, our messaging is still evolving. We appreciate your patience.

English

228

Humphrey Shi@humphrey_shi·21 Şub

Decisions for @CVPR 2026 are out—congratulations to all authors. I’m excited to share a community step forward: the new CVPR Findings Track. Area Chairs recommended 1717 papers for potential inclusion, creating a principled pathway to recognize and share valuable work that may not be the best fit for the main program—while still enabling authors to publish and present through integrated Findings poster sessions. As our field scales, we need not only better models—but better community infrastructure. This effort is led collectively by the Findings organizing team—Bryan Plummer, Kevin Shih, @anand_bhattad, @jccaicedo, @Grigoris_c, @BoqingGo, @liuziwei7, and me. Huge thanks to the CVPR General Chairs, Program Chairs, and especially the Area Chairs for supporting this step forward. Looking forward to seeing many of you at CVPR 2026—across the main program, Findings, and workshops.

English

34.1K

Michael Smith@TheCIMSmith·21 Şub

@humphrey_shi @chen940382 @CVPR It's been a very confusing morning, but am I correct in saying that a "Suggest To Findings Workshop" = Yes from the AC means accepted for the findings track of CVPR 2026? Also, how is findings both a workshop and a track?

English

998

Humphrey Shi@humphrey_shi·21 Şub

@chen940382 @CVPR Thanks! We’re finalizing the opt-in flow and logistics now. Official instructions (incl. timing + poster details) will go out to authors soon—please stay tuned.

English

8.4K

Michael Smith retweetledi

Teddy Sjöström@TheoVanGrind·22 Tem

@Jonathan_Blow regarding software bloat outpacing hardware progress: found my 18y old Pentium II 400mhz laptop and fired up MS Word 98...

English

157

1.3K

10.7K

Michael Smith retweetledi

Zhuang Liu@liuzhuang1234·12 Ara

Stronger Normalization-Free Transformers – new paper. We introduce Derf (Dynamic erf), a simple point-wise layer that lets norm-free Transformers not only work, but actually outperform their normalized counterparts.

English

176

1.1K

165.7K

Michael Smith retweetledi

vLLM@vllm_project·20 Eki

🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping 97% OCR accuracy at <10×. 📄 Outperforms GOT-OCR2.0 & MinerU2.0 on OmniDocBench using fewer vision tokens. 🤝 The vLLM team is working with DeepSeek to bring official DeepSeek-OCR support into the next vLLM release — making multimodal inference even faster and easier to scale. 🔗 github.com/deepseek-ai/De… #vLLM #DeepSeek #OCR #LLM #VisionAI #DeepLearning

English

367

2.6K

1.5M

Michael Smith retweetledi

Andrej Karpathy@karpathy·24 Eki

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: github.com/karpathy/nanoc… This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)

English

177

433

4.4K

572.3K

Michael Smith@TheCIMSmith·17 Ağu

@thegauguy Yes exactly, this changes the way Linux sets the time on the machine to the way Windows interprets it so they stop conflicting.

English

SuperGAU@thegauguy·17 Ağu

@TheCIMSmith Not my Linux time is broken. My Windows time is broken when I boot into Windows directly after using Linux on the same machine.

English

SuperGAU@thegauguy·16 Ağu

The most annoying thing ever is something many people don't encounter ever. Booting into your Windows system after having used a Linux system before and suddenly, your system time is -2 hours your actual time.

English

1.8K

Michael Smith@TheCIMSmith·15 May

@Nrg8000 Way too much hype going on, but technically speaking at the end of the day a model is learning patterns and ones with minimal presence in the data like old Australian terror camps will get disregarded, if much data from Australia even made it into the training.

English

Nathan Ruser@Nrg8000·15 May

The same search on Factiva also finds plenty more on it. There are also hits on the National Library of Australia's archive. Is there any particular reason this is so difficult for AI to find, or have they just pulled the wool-over-our-eyes pretending its competent?

English

3.1K

Nathan Ruser@Nrg8000·15 May

I had a vague recollection of a terror training camp in the late 1990s or early 2000s being discovered near the NSW town of Braidwood, so I asked ChatGPT to dig up the reference. It couldn't find it, not a single AI could. Tell me if/how you can make AI find it (it is real ⬇️)

English

Michael Smith@TheCIMSmith·17 Nis

@cloneofsimo As @MoummadIlyass points out not necessarily. I'm trying to get Dinov2 to work on a task with limited data and what I didn't think would be considered a distribution shift but I cannot get it to beat a R50...

English

126

Simo Ryu@cloneofsimo·17 Nis

Im looking for literature to help me understand this: Consider you have fixed compute C to utilize train SSL model like DinoV2 (R) and downstream task (T). It is well known that if T has small data, one *needs* to leverage R, so using lot of compute on R is rational thing to do. Infact, this was systematically studided in Scaling Law for Transfer. -> More compute for pretraining is better, if T is small. But if the T has infinite compute and infinite data, like diffusion pretraining, is it better to do R at all? REPA line of work suggests yes: do distillation from R objective, but they don't count the training cost of R. Recent apple's distillation paper that if you consider total FLOPs of the teacher and student, and inference cost of teacher, distillation is not compute optimal (that is, you consider both training and inference of teacher model overall) Similarly Dinov2 pretraining compute is not free: 650k steps with 3000 batch size, 1.1B parameters, distilled later. Dinov2 reports 22k hours, which with A100 with decent MFU transfers to almost half of SD3-large pretraining cost. (~ 2.5 * 10^22 FLOPs) REPA-line of works (Recent DDT, LightningDiT, Seedream, etc) don't report the entire compute budget, and assume Dino-v2 is 'given', and assume inference cost on Dino-v2 is also free. Is it still optimal once you can allocate extra 2.5 * 10^22 FLOPs to the pretraining stage? This is open question that I honestly don't know and genuinely want to hear some opinions / literature on If any answer is convincing im going to rerun-repa experiments lmao

Simo Ryu@cloneofsimo

Seedream 3.0 Do I need to re-do my REPA experiments🤔

English

130

13.7K

Michael Smith retweetledi

Jan Barta@absurdtrader·18 Şub

We live in a time increasingly resembling the second half of the 1930s. I couldn't live with myself with the knowledge I didn't do more to stop another Hitler. In light of the Republican wavering and in memory of Navalny I will give $100 towards FPV drones for Ukr for every RT.

English

980

25.3K

20.2K

3.3M

Michael Smith@TheCIMSmith·20 Oca

@CasualArtyFan Only 2 man crew?

English

227

CJ@CasualArtyFan·20 Oca

“But as I played video games, I remembered everything” - Ukrainian Chadley TC/Gunner

WarTranslated@wartranslated

A more personal and detailed interview with the crew of the Ukrainian Bradley which took on a russian T-90M. Also goes over using American-supplied Bradley IFVs in conditions of the Ukrainian winter 💪

English

264

17.2K

Keşfet

@ctocevents @CVPR @abursuc @danidbio @bnjmn_marie @ysu_ChatData @humphrey_shi @chen940382