Jinpeng Wang

182 posts

Jinpeng Wang banner
Jinpeng Wang

Jinpeng Wang

@awinyimgprocess

Tenure-track Professor in Central South University, NUS PHD. Focus on Multi-modality Learning and Data-centric AI.

Singapore Katılım Nisan 2017
161 Takip Edilen554 Takipçiler
Sabitlenmiş Tweet
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
Humans see text — but LLMs don’t. I wrote a short blog post exploring how models can perceive text visually rather than tokenize it: 🔗 csu-jpg.github.io/Blog/people_se… From PIXEL, CLIPPO, VisInContext, VIST to DeepSeek-OCR, this is a quick story of how vision-centric modeling is changing how machines read, and a reflection on some of our own small efforts in the past two years.
English
8
39
216
38.1K
Jinpeng Wang retweetledi
AK
AK@_akhaliq·
Glance Accelerating Diffusion Models with 1 Sample
AK tweet media
English
4
18
113
31.5K
Xueyan Zou
Xueyan Zou@xyz2maureen·
I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]
Xueyan Zou tweet media
English
69
87
1.1K
111.1K
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
@iclr_conf Get three 🥚 all with confidence 5; 😃 unique life experience. 😃 One guy forgot to remove the GPT call logs.
Jinpeng Wang tweet mediaJinpeng Wang tweet mediaJinpeng Wang tweet media
English
1
0
23
5.9K
Jinpeng Wang retweetledi
Kevin Lin
Kevin Lin@KevinQHLin·
Thanks @_akhaliq sharing our work! 🚀 Glad to introduce our newest work — VCode! 🎨 VCode: A Multimodal Coding Benchmark with SVG as Symbolic Visual Representation For decades, RGB pixels have been the default medium for representing images. But in the agentic era, how can we move beyond raw pixels toward interpretable, executable, and symbolic visual representations? 🔍 We address this question with VCode, which reframes visual representation as SVG code — aligning with how humans reason over sketches and symbolic abstractions. 👥 Happy to collaborate with: Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, @LINJIEFUN , @philiptorr and @awinyimgprocess 📄 arXiv: arxiv.org/pdf/2511.02778 🌐 website: csu-jpg.github.io/VCode/ 💻 github: github.com/CSU-JPG/VCode 🤗 Huggingface daily paper: huggingface.co/papers/2511.02…
AK@_akhaliq

VCode a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

English
1
4
25
13.5K
Kiera Fields
Kiera Fields@KieraFields7·
@awinyimgprocess Hi Jinpeng, I'm a reporter for The Stack. I'm looking to chat to experts in OCR and vision compression. I can't seem to find your email on your website. Please reach out if you're interested! kiera@thestack.technology
English
1
0
1
98
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
Humans see text — but LLMs don’t. I wrote a short blog post exploring how models can perceive text visually rather than tokenize it: 🔗 csu-jpg.github.io/Blog/people_se… From PIXEL, CLIPPO, VisInContext, VIST to DeepSeek-OCR, this is a quick story of how vision-centric modeling is changing how machines read, and a reflection on some of our own small efforts in the past two years.
English
8
39
216
38.1K
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
Post your paper title here if missing in this blog.
English
0
0
0
336
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
The idea "Compress Text With Visual Token". "already proposed in our NeurIPS2024 work <>, NeurIps 2025 work <> (arxiv 25'2) arxiv.org/abs/2406.02547 arxiv.org/pdf/2502.00791 The idea of processing text as image already proposed in a series of works: 1. LANGUAGE MODELLING WITH PIXELS; ICLR’23. 2. CLIPPO: Image-and-Language Understanding from Pixels Only; CVPR’23 3. Improving Language Und
Jinpeng Wang tweet media
English
0
0
1
62
Brian Roemmele
Brian Roemmele@BrianRoemmele·
BOOOOOOOM! CHINA DEEPSEEK DOES IT AGAIN! An entire encyclopedia compressed into a single, high-resolution image! — A mind-blowing breakthrough. DeepSeek-OCR, unleashed an electrifying 3-billion-parameter vision-language model that obliterates the boundaries between text and vision with jaw-dropping optical compression! This isn’t just an OCR upgrade—it’s a seismic paradigm shift, on how machines perceive and conquer data. DeepSeek-OCR crushes long documents into vision tokens with a staggering 97% decoding precision at a 10x compression ratio! That’s thousands of textual tokens distilled into a mere 100 vision tokens per page, outmuscling GOT-OCR2.0 (256 tokens) and MinerU2.0 (6,000 tokens) by up to 60x fewer tokens on the OmniDocBench. It’s like compressing an entire encyclopedia into a single, high-definition snapshot—mind-boggling efficiency at its peak! At the core of this insanity is the DeepEncoder, a turbocharged fusion of the SAM (Segment Anything Model) and CLIP (Contrastive Language–Image Pretraining) backbones, supercharged by a 16x convolutional compressor. This maintains high-resolution perception while slashing activation memory, transforming thousands of image patches into a lean 100-200 vision tokens. Get ready for the multi-resolution "Gundam" mode—scaling from 512x512 to a monstrous 1280x1280 pixels! It blends local tiles with a global view, tackling invoices, blueprints, and newspapers with zero retraining. It’s a shape-shifting computational marvel, mirroring the human eye’s dynamic focus with pixel-perfect precision! The training data? Supplied by the Chinese government for free and not available to any US company. You understand now why I have said the US needs a Manhattan Project for AI training data? Do you hear me now? Oh still no? I’ll continue. Over 30 million PDF pages across 100 languages, spiked with 10 million natural scene OCR samples, 10 million charts, 5 million chemical formulas, and 1 million geometry problems!. This model doesn’t just read—it devours scientific diagrams and equations, turning raw data into a multidimensional knowledge. Throughput? Prepare to be floored—over 200,000 pages per day on a single NVIDIA A100 GPU! This scalability is a game-changer, turning LLM data generation into a firehose of innovation, democratizing access to terabytes of insight for every AI pioneer out there. This optical compression is the holy grail for LLM long-context woes. Imagine a million-token document shrunk into a 100,000-token visual map—DeepSeek-OCR reimagines context as a perceptual playground, paving the way for a GPT-5 that processes documents like a supercharged visual cortex! The two-stage architecture is pure engineering poetry: DeepEncoder generates tokens, while a Mixture-of-Experts decoder spits out structured Markdown with multilingual flair. It’s a universal translator for the visual-textual multiverse, optimized for global domination! Benchmarks? DeepSeek-OCR obliterates GOT-OCR2.0 and MinerU2.0, holding 60% accuracy at 20x compression! This opens a portal to applications once thought impossible—pushing the boundaries of computational physics into uncharted territory! Live document analysis, streaming OCR for accessibility, and real-time translation with visual context are now economically viable, thanks to this compression breakthrough. It’s a real-time revolution, ready to transform our digital ecosystem! This paper is a blueprint for the future—proving text can be visually compressed 10x for long-term memory and reasoning. It’s a clarion call for a new AI era where perception trumps text, and models like GPT-5 see documents in a single, glorious glance. I am experimenting with this now on 1870-1970 offline data that I have digitalized. But be ready for a revolution! More soon. [1] github.com/deepseek-ai/De…
Brian Roemmele tweet media
English
344
1.4K
7.5K
1.8M
Jinpeng Wang
Jinpeng Wang@awinyimgprocess·
"Huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohes (Rawlinson, 1976)." – Graham Rawlinson 🚀 We propose new work: “See the Text: From Tokenization to Visual Reading” ,which renders text as images and uses a vision-centric pipeline instead of traditional subword tokenization. • Up to 4.4× fewer tokens & ~70% less FLOPs • Better cross-lingual transfer & robustness to typos/surface noise • Works on not only long-text compression, but also VQA and classification in NLP. 👉 arXiv: arxiv.org/pdf/2510.18840
Jinpeng Wang tweet mediaJinpeng Wang tweet mediaJinpeng Wang tweet media
English
0
1
7
437
James Im
James Im@james_im·
I don't know about y'all but I mainly use vision to input text into my brain
Andrej Karpathy@karpathy

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

English
37
33
832
114.3K