Michael Smith

139 posts

Michael Smith

Michael Smith

@TheCIMSmith

PhD Student, CIM/APL

Montréal, Québec Katılım Aralık 2018
243 Takip Edilen26 Takipçiler
Michael Smith
Michael Smith@TheCIMSmith·
#CVPR Findings posters are at 7am, per the tentative schedule.💀
English
0
0
0
3
Michael Smith
Michael Smith@TheCIMSmith·
@ctocevents @CVPR Thanks! I also assume that CVPR will continue the advanced poster printing offer this year? Is there any information available yet?
English
1
0
0
87
Michael Smith
Michael Smith@TheCIMSmith·
@CVPR Is there a luggage check on June 7 this year?
English
1
0
0
177
danidbio
danidbio@danidbio·
that's it, we're looking to migrate off github. Self-hosted forgejo? I'm fine with the minor extra expense and maintenance burden/cost. Tired of putting up with the same issues/stagnation of core features while copilot slop we don't want keeps getting pushed at us
English
1
0
0
28
Michael Smith
Michael Smith@TheCIMSmith·
@bnjmn_marie @ysu_ChatData Anecdotally, running Qwen3.5 35B (UD Q4) with a Q4_0 KV broke down doing some agentic coding somewhere around 32K tokens across multiple attempts. Changing to default llama.cpp KV cache, it worked fine.
English
1
0
0
18
Benjamin Marie
Benjamin Marie@bnjmn_marie·
I was almost certain that Q4 KV-cache quantization would wreck Qwen3.5. It didn’t. So far, I’m not seeing any meaningful accuracy loss. That said, it probably only works for good GGUF versions. If you begin with a poorly quantized Qwen3.5, e.g., Q2, the KV cache is already noisier and much harder to quantize reliably. Note: Again, don't be surprised by the quantized models performing higher than the original. With small benchmarks, it can happen and it's not significant. Changing the seed can also have this effect. Tomorrow, I’ll post a full summary of my Qwen3.5 GGUF evals on my blog (link in bio), including the methodology and new results.
Benjamin Marie tweet media
English
13
20
219
10.8K
Michael Smith
Michael Smith@TheCIMSmith·
@bnjmn_marie So pass@k means that each prompt is run through the model k times, and the answer is correct if at least one of the k times it was correct? Following the logic of SPoC (Kulal et al., NeurIPS 2019)? The error bars must be large then...
English
0
0
0
325
Benjamin Marie
Benjamin Marie@bnjmn_marie·
Qwen3.5 27B is worse than 397B at coding. But only one retry is enough to erase the gap. LiveCodeBench Accuracy (thinking disabled): - Qwen3.5 27B pass@1: 71 - Qwen3.5 397B pass@1: 79 - Qwen3.5 27B pass@2: 81 - Qwen3.5 27B pass@4: 86 Translation: if you can test the first answer and ask for one more try, 27B gives you about 397B-level coding performance, for way less cost. 4 tries, and you get better results.
Benjamin Marie tweet media
English
39
44
544
48.1K
Humphrey Shi
Humphrey Shi@humphrey_shi·
Decisions for @CVPR 2026 are out—congratulations to all authors. I’m excited to share a community step forward: the new CVPR Findings Track. Area Chairs recommended 1717 papers for potential inclusion, creating a principled pathway to recognize and share valuable work that may not be the best fit for the main program—while still enabling authors to publish and present through integrated Findings poster sessions. As our field scales, we need not only better models—but better community infrastructure. This effort is led collectively by the Findings organizing team—Bryan Plummer, Kevin Shih, @anand_bhattad, @jccaicedo, @Grigoris_c, @BoqingGo, @liuziwei7, and me. Huge thanks to the CVPR General Chairs, Program Chairs, and especially the Area Chairs for supporting this step forward. Looking forward to seeing many of you at CVPR 2026—across the main program, Findings, and workshops.
Humphrey Shi tweet media
English
6
12
68
34.1K
Michael Smith
Michael Smith@TheCIMSmith·
@humphrey_shi @chen940382 @CVPR It's been a very confusing morning, but am I correct in saying that a "Suggest To Findings Workshop" = Yes from the AC means accepted for the findings track of CVPR 2026? Also, how is findings both a workshop and a track?
English
1
0
3
998
Humphrey Shi
Humphrey Shi@humphrey_shi·
@chen940382 @CVPR Thanks! We’re finalizing the opt-in flow and logistics now. Official instructions (incl. timing + poster details) will go out to authors soon—please stay tuned.
English
4
1
13
8.4K
Michael Smith retweetledi
Teddy Sjöström
Teddy Sjöström@TheoVanGrind·
@Jonathan_Blow regarding software bloat outpacing hardware progress: found my 18y old Pentium II 400mhz laptop and fired up MS Word 98...
English
157
1.3K
10.7K
0
Michael Smith retweetledi
Zhuang Liu
Zhuang Liu@liuzhuang1234·
Stronger Normalization-Free Transformers – new paper. We introduce Derf (Dynamic erf), a simple point-wise layer that lets norm-free Transformers not only work, but actually outperform their normalized counterparts.
Zhuang Liu tweet media
English
19
176
1.1K
165.7K
Michael Smith retweetledi
vLLM
vLLM@vllm_project·
🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping 97% OCR accuracy at <10×. 📄 Outperforms GOT-OCR2.0 & MinerU2.0 on OmniDocBench using fewer vision tokens. 🤝 The vLLM team is working with DeepSeek to bring official DeepSeek-OCR support into the next vLLM release — making multimodal inference even faster and easier to scale. 🔗 github.com/deepseek-ai/De… #vLLM #DeepSeek #OCR #LLM #VisionAI #DeepLearning
vLLM tweet mediavLLM tweet mediavLLM tweet media
English
53
367
2.6K
1.5M
Michael Smith retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: github.com/karpathy/nanoc… This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)
Andrej Karpathy tweet media
English
177
433
4.4K
572.3K
Michael Smith
Michael Smith@TheCIMSmith·
@thegauguy Yes exactly, this changes the way Linux sets the time on the machine to the way Windows interprets it so they stop conflicting.
English
1
0
0
35
SuperGAU
SuperGAU@thegauguy·
@TheCIMSmith Not my Linux time is broken. My Windows time is broken when I boot into Windows directly after using Linux on the same machine.
English
1
0
0
32
SuperGAU
SuperGAU@thegauguy·
The most annoying thing ever is something many people don't encounter ever. Booting into your Windows system after having used a Linux system before and suddenly, your system time is -2 hours your actual time.
English
3
0
20
1.8K
Michael Smith
Michael Smith@TheCIMSmith·
@Nrg8000 Way too much hype going on, but technically speaking at the end of the day a model is learning patterns and ones with minimal presence in the data like old Australian terror camps will get disregarded, if much data from Australia even made it into the training.
English
0
0
0
24
Nathan Ruser
Nathan Ruser@Nrg8000·
The same search on Factiva also finds plenty more on it. There are also hits on the National Library of Australia's archive. Is there any particular reason this is so difficult for AI to find, or have they just pulled the wool-over-our-eyes pretending its competent?
Nathan Ruser tweet media
English
2
1
16
3.1K
Nathan Ruser
Nathan Ruser@Nrg8000·
I had a vague recollection of a terror training camp in the late 1990s or early 2000s being discovered near the NSW town of Braidwood, so I asked ChatGPT to dig up the reference. It couldn't find it, not a single AI could. Tell me if/how you can make AI find it (it is real ⬇️)
English
6
0
21
6K
Michael Smith
Michael Smith@TheCIMSmith·
@cloneofsimo As @MoummadIlyass points out not necessarily. I'm trying to get Dinov2 to work on a task with limited data and what I didn't think would be considered a distribution shift but I cannot get it to beat a R50...
English
0
0
1
126
Simo Ryu
Simo Ryu@cloneofsimo·
Im looking for literature to help me understand this: Consider you have fixed compute C to utilize train SSL model like DinoV2 (R) and downstream task (T). It is well known that if T has small data, one *needs* to leverage R, so using lot of compute on R is rational thing to do. Infact, this was systematically studided in Scaling Law for Transfer. -> More compute for pretraining is better, if T is small. But if the T has infinite compute and infinite data, like diffusion pretraining, is it better to do R at all? REPA line of work suggests yes: do distillation from R objective, but they don't count the training cost of R. Recent apple's distillation paper that if you consider total FLOPs of the teacher and student, and inference cost of teacher, distillation is not compute optimal (that is, you consider both training and inference of teacher model overall) Similarly Dinov2 pretraining compute is not free: 650k steps with 3000 batch size, 1.1B parameters, distilled later. Dinov2 reports 22k hours, which with A100 with decent MFU transfers to almost half of SD3-large pretraining cost. (~ 2.5 * 10^22 FLOPs) REPA-line of works (Recent DDT, LightningDiT, Seedream, etc) don't report the entire compute budget, and assume Dino-v2 is 'given', and assume inference cost on Dino-v2 is also free. Is it still optimal once you can allocate extra 2.5 * 10^22 FLOPs to the pretraining stage? This is open question that I honestly don't know and genuinely want to hear some opinions / literature on If any answer is convincing im going to rerun-repa experiments lmao
Simo Ryu tweet mediaSimo Ryu tweet media
Simo Ryu@cloneofsimo

Seedream 3.0 Do I need to re-do my REPA experiments🤔

English
8
13
130
13.7K
Michael Smith retweetledi
Jan Barta
Jan Barta@absurdtrader·
We live in a time increasingly resembling the second half of the 1930s. I couldn't live with myself with the knowledge I didn't do more to stop another Hitler. In light of the Republican wavering and in memory of Navalny I will give $100 towards FPV drones for Ukr for every RT.
English
980
25.3K
20.2K
3.3M