Jinpeng Wang

182 posts

Jinpeng Wang

@awinyimgprocess

Tenure-track Professor in Central South University， NUS PHD. Focus on Multi-modality Learning and Data-centric AI.

Singapore Katılım Nisan 2017

161 Takip Edilen554 Takipçiler

Sabitlenmiş Tweet

Jinpeng Wang@awinyimgprocess·22 Eki

Humans see text — but LLMs don’t. I wrote a short blog post exploring how models can perceive text visually rather than tokenize it: 🔗 csu-jpg.github.io/Blog/people_se… From PIXEL, CLIPPO, VisInContext, VIST to DeepSeek-OCR, this is a quick story of how vision-centric modeling is changing how machines read, and a reflection on some of our own small efforts in the past two years.

English

216

38.1K

Jinpeng Wang@awinyimgprocess·24 Şub

x.com/i/article/2026…

ZXX

260

Jinpeng Wang@awinyimgprocess·3 Ara

Adapt the Qwen-Image model to a target domain with one single example.

Jinpeng Wang@awinyimgprocess

Thanks AK for sharing our work. We speed up Qwen-image and Flux Inference speed from 50 steps to less than 10steps. 1 sample is enough for speeding up diffusion model in specific domain. Paper Link: arxiv.org/abs/2512.02899 Code Link: github.com/CSU-JPG/Glance

English

416

Jinpeng Wang@awinyimgprocess·3 Ara

AK@_akhaliq

Glance Accelerating Diffusion Models with 1 Sample

English

19.5K

Jinpeng Wang retweetledi

AK@_akhaliq·3 Ara

Glance Accelerating Diffusion Models with 1 Sample

English

113

31.5K

Jinpeng Wang@awinyimgprocess·2 Ara

@xyz2maureen 👏👏👏

QME

281

Xueyan Zou@xyz2maureen·2 Ara

I will join Tsinghua University, College of AI, as an Assistant Professor in the coming month. I am actively looking for 2026 spring interns and future PhDs (ping me if you are in #NeurIPS). It has been an incredible journey of 10 years since I attended an activity organized by Tsinghua University and decided to change my undergraduate major from Economics to Computer Science, inspired by one of the teammates. During the 10 years, I met with appreciation of many wonderful researchers/professors who led me to continued growth. 🐿️ My research focus will continue to be AI & Robotics, with a specific emphasis on Interactive Embodied Intelligence. You can check my homepage to learn more: maureenzou.github.io/lab.html. I am currently local to San Diego and will be attending #NeurIPS. Please ping me over WeChat or Email if any old or new friends are interested in having a coffee chat! (Really looking forward to meeting as many friends as possible at #NeurIPS) [The photo is one of the places that I will miss a lot in the US]

English

1.1K

111.1K

Jinpeng Wang@awinyimgprocess·25 Kas

@Alibaba_Qwen

Rui Zhao@ruizhaocv

🚀 We couldn't find a solid multi-GPU fine-tuning pipeline for Qwen Image Edit… so we built one ourselves. If you're working on image editing, fine-tuning, ect., this repo is made for you. ⭐ Try it out & star the repo if you find it helpful! 👉 github.com/SuyangLumiere/…

QAM

208

Jinpeng Wang retweetledi

Kevin Lin@KevinQHLin·24 Kas

We added Gemini3-Pro #Gemini results on VCode! Check out at csu-jpg.github.io/VCode/ 🏅Gemini3-Pro performs quite well at image2svg. It achieved 52.1 in VCode, surpassing GPT-5's 46.8 (+5.3). Open-sourced at github.com/CSU-JPG/VCode

AK@_akhaliq

VCode app is out SVG as Symbolic Visual Representation Turn any image into symbolic SVG

English

2.7K

Jinpeng Wang@awinyimgprocess·16 Kas

thanks JIQIZHIXIN for sharing our work. We use visual encoder to compress long text.

机器之心 JIQIZHIXIN@jiqizhixin

As LLMs grow to trillions of parameters and context windows stretch to hundreds of thousands of tokens, compute costs explode. Vision Centric Token Compression (Vist) offers a fix — a vision-centric token compression method that mimics human reading. It converts distant context into images for a vision encoder to skim, while the LLM focuses on nearby text. Matching full-context accuracy with 2.3× fewer tokens, Vist slashes FLOPs by 16%, memory by 50%, and outperforms prior compression methods by 7.6% across major benchmarks. Vision-centric Token Compression in Large Language Model NUST, CSU, NFU Paper: arxiv.org/abs/2502.00791 Our report: mp.weixin.qq.com/s/zYnxpBhRsndl… 📬 #PapersAccepted by Jiqizhixin

English

1.1K

Jinpeng Wang@awinyimgprocess·12 Kas

@iclr_conf Get three 🥚 all with confidence 5； 😃 unique life experience. 😃 One guy forgot to remove the GPT call logs.

English

5.9K

Jinpeng Wang@awinyimgprocess·8 Kas

Unique avatar 🤣🤣🤣😂😂😂 Try demo at: huggingface.co/spaces/CSU-JPG…

English

443

Jinpeng Wang@awinyimgprocess·8 Kas

🤣🤣🤣🤣🤣

Kevin Lin@KevinQHLin

😲Turn any image into symbolic SVG at huggingface.co/spaces/CSU-JPG… 💻Github: github.com/CSU-JPG/VCode Built with @xinlinzyh and @zzranhangyu Thanks @_akhaliq for sharing and Gradio demo team @huggingface @Gradio!

ART

288

Jinpeng Wang@awinyimgprocess·8 Kas

Turn any interesting imgs into svg🤣🤣🤣🤣🤣

AK@_akhaliq

VCode app is out SVG as Symbolic Visual Representation Turn any image into symbolic SVG

English

Jinpeng Wang retweetledi

Kevin Lin@KevinQHLin·5 Kas

Thanks @_akhaliq sharing our work! 🚀 Glad to introduce our newest work — VCode! 🎨 VCode: A Multimodal Coding Benchmark with SVG as Symbolic Visual Representation For decades, RGB pixels have been the default medium for representing images. But in the agentic era, how can we move beyond raw pixels toward interpretable, executable, and symbolic visual representations? 🔍 We address this question with VCode, which reframes visual representation as SVG code — aligning with how humans reason over sketches and symbolic abstractions. 👥 Happy to collaborate with: Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, @LINJIEFUN , @philiptorr and @awinyimgprocess 📄 arXiv: arxiv.org/pdf/2511.02778 🌐 website: csu-jpg.github.io/VCode/ 💻 github: github.com/CSU-JPG/VCode 🤗 Huggingface daily paper: huggingface.co/papers/2511.02…

AK@_akhaliq

VCode a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

English

13.5K

Jinpeng Wang@awinyimgprocess·30 Eki

@KieraFields7 Already reply in my csu email.

English

Kiera Fields@KieraFields7·27 Eki

@awinyimgprocess Hi Jinpeng, I'm a reporter for The Stack. I'm looking to chat to experts in OCR and vision compression. I can't seem to find your email on your website. Please reach out if you're interested! kiera@thestack.technology

English

Jinpeng Wang@awinyimgprocess·22 Eki

English

216

38.1K

Jinpeng Wang@awinyimgprocess·25 Eki

Post your paper title here if missing in this blog.

English

336

Jinpeng Wang@awinyimgprocess·23 Eki

The idea "Compress Text With Visual Token". "already proposed in our NeurIPS2024 work <>, NeurIps 2025 work <> (arxiv 25'2) arxiv.org/abs/2406.02547 arxiv.org/pdf/2502.00791 The idea of processing text as image already proposed in a series of works: 1. LANGUAGE MODELLING WITH PIXELS； ICLR’23. 2. CLIPPO: Image-and-Language Understanding from Pixels Only； CVPR’23 3. Improving Language Und

English

Brian Roemmele@BrianRoemmele·20 Eki

BOOOOOOOM! CHINA DEEPSEEK DOES IT AGAIN! An entire encyclopedia compressed into a single, high-resolution image! — A mind-blowing breakthrough. DeepSeek-OCR, unleashed an electrifying 3-billion-parameter vision-language model that obliterates the boundaries between text and vision with jaw-dropping optical compression! This isn’t just an OCR upgrade—it’s a seismic paradigm shift, on how machines perceive and conquer data. DeepSeek-OCR crushes long documents into vision tokens with a staggering 97% decoding precision at a 10x compression ratio! That’s thousands of textual tokens distilled into a mere 100 vision tokens per page, outmuscling GOT-OCR2.0 (256 tokens) and MinerU2.0 (6,000 tokens) by up to 60x fewer tokens on the OmniDocBench. It’s like compressing an entire encyclopedia into a single, high-definition snapshot—mind-boggling efficiency at its peak! At the core of this insanity is the DeepEncoder, a turbocharged fusion of the SAM (Segment Anything Model) and CLIP (Contrastive Language–Image Pretraining) backbones, supercharged by a 16x convolutional compressor. This maintains high-resolution perception while slashing activation memory, transforming thousands of image patches into a lean 100-200 vision tokens. Get ready for the multi-resolution "Gundam" mode—scaling from 512x512 to a monstrous 1280x1280 pixels! It blends local tiles with a global view, tackling invoices, blueprints, and newspapers with zero retraining. It’s a shape-shifting computational marvel, mirroring the human eye’s dynamic focus with pixel-perfect precision! The training data? Supplied by the Chinese government for free and not available to any US company. You understand now why I have said the US needs a Manhattan Project for AI training data? Do you hear me now? Oh still no? I’ll continue. Over 30 million PDF pages across 100 languages, spiked with 10 million natural scene OCR samples, 10 million charts, 5 million chemical formulas, and 1 million geometry problems!. This model doesn’t just read—it devours scientific diagrams and equations, turning raw data into a multidimensional knowledge. Throughput? Prepare to be floored—over 200,000 pages per day on a single NVIDIA A100 GPU! This scalability is a game-changer, turning LLM data generation into a firehose of innovation, democratizing access to terabytes of insight for every AI pioneer out there. This optical compression is the holy grail for LLM long-context woes. Imagine a million-token document shrunk into a 100,000-token visual map—DeepSeek-OCR reimagines context as a perceptual playground, paving the way for a GPT-5 that processes documents like a supercharged visual cortex! The two-stage architecture is pure engineering poetry: DeepEncoder generates tokens, while a Mixture-of-Experts decoder spits out structured Markdown with multilingual flair. It’s a universal translator for the visual-textual multiverse, optimized for global domination! Benchmarks? DeepSeek-OCR obliterates GOT-OCR2.0 and MinerU2.0, holding 60% accuracy at 20x compression! This opens a portal to applications once thought impossible—pushing the boundaries of computational physics into uncharted territory! Live document analysis, streaming OCR for accessibility, and real-time translation with visual context are now economically viable, thanks to this compression breakthrough. It’s a real-time revolution, ready to transform our digital ecosystem! This paper is a blueprint for the future—proving text can be visually compressed 10x for long-term memory and reasoning. It’s a clarion call for a new AI era where perception trumps text, and models like GPT-5 see documents in a single, glorious glance. I am experimenting with this now on 1870-1970 offline data that I have digitalized. But be ready for a revolution! More soon. [1] github.com/deepseek-ai/De…

English

344

1.4K

7.5K

1.8M

Jinpeng Wang@awinyimgprocess·23 Eki

"Huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohes (Rawlinson, 1976)." – Graham Rawlinson 🚀 We propose new work: “See the Text: From Tokenization to Visual Reading” ，which renders text as images and uses a vision-centric pipeline instead of traditional subword tokenization. • Up to 4.4× fewer tokens & ~70% less FLOPs • Better cross-lingual transfer & robustness to typos/surface noise • Works on not only long-text compression, but also VQA and classification in NLP. 👉 arXiv: arxiv.org/pdf/2510.18840

English

437

Jinpeng Wang@awinyimgprocess·23 Eki

@james_im I write a blog csu-jpg.github.io/Blog/people_se… (People See Text, But LLM Not).

English

115

James Im@james_im·21 Eki

I don't know about y'all but I mainly use vision to input text into my brain

Andrej Karpathy@karpathy

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

English

832

114.3K

Keşfet

@xyz2maureen @Alibaba_Qwen @iclr_conf @_akhaliq @LINJIEFUN @philiptorr @KieraFields7 @elonmusk