
sunil mallya
4.4K posts

sunil mallya
@sunilmallya
CTO/Co-Founder @_FlipAI; ex-Head of Eng #AmazonComprehend #NLP, Creator #AWSDeepRacer #SF, Tweets on ML, RL, Privacy, Cats








Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI






Also: *EXPOSURE DOES NOT MEAN THREAT OF DISPLACEMENT* *EXPOSURE DOES NOT MEAN THREAT OF DISPLACEMENT* *EXPOSURE DOES NOT MEAN THREAT OF DISPLACEMENT* It can literally mean the opposite: AI exposed jobs may increase hiring and attract higher wages. It all depends on a) elasticity of consumer demand and b) number of AI exposed tasks in a job.


BREAKING: Amazon reportedly holds mandatory meeting after “vibe coded” changes trigger major outages.











The new multimodal Qwen3.5 @Alibaba_Qwen running locally on an old iPhone 12 with pretty good speed (10 tok/s) Uses a modified llama.cpp backend to support metal memory layout for Apple A series chip and SSM ops. Building on top of great library by @ggerganov and @claudeai code with zero iOS experience and old kernel dev experience :)

🚀 Introducing the Qwen 3.5 Small Model Series Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B ✨ More intelligence, less compute. These small models are built on the same Qwen3.5 foundation — native multimodal, improved architecture, scaled RL: • 0.8B / 2B → tiny, fast, great for edge device • 4B → a surprisingly strong multimodal base for lightweight agents • 9B → compact, but already closing the gap with much larger models And yes — we’re also releasing the Base models as well. We hope this better supports research, experimentation, and real-world industrial innovation. Hugging Face: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw…

I just built an iOS app that runs @liquidai VL1.6B model locally on an iPhone 12 at ⚡️ speed: 15 tok/sec. I had to write a custom memory allocator for the llama.cpp backend to make it use the iPhone's Metal GPU. Fun project that spanned a few weekends!


