Pranesh Santikellur
23 posts

Pranesh Santikellur
@PranesH_
Asst. Prof @ IIIT-Bangalore, CS PhD from IIT Kharagpur 🇮🇳














The whale is back! Janus 1.3B an multi-modal LM for any-to-any task. Beats DallE 2/ SDXL in generation and Llava 1.5 7B in multimodal - MIT licensed 🔥 Evaluations: - MMBench: 69.4 (outperforms LLaVA-v1.5 7B: 67.9) - SEED-Bench: 63.7 (outperforms LLaVA-v1.5 7B: 62.4) - POPE: 87.0 (outperforms LLaVA-v1.5 7B: 85.5) - MSCOCO-30K: FID score of 8.53 (outperforms DALL-E 2: 9.0) - GenEval: Accuracy of 61% (outperforms SDXL: 58%) Model Architecture: > 1.3B (outperforms models with 7B parameters) > Two independent pathways for understanding and generation > Unified Transformer: Shares the same architecture for both pathways - Uses the built-in tokenizer of the LLM to convert text into discrete IDs - Employs SigLIP encoder to extract high-dimensional semantic features from images, flattened into a 1-D sequence > Visual Generation: Utilizes VQ tokenizer to convert images into discrete IDs, flattened into a 1-D sequence - Feature Mapping: Uses understanding and generation adaptors to map image features and codebook embeddings into the LLM input space - Prediction Heads: Built-in for text predictions, randomly initialized for image predictions > Model checkpoints on the Hub and compatible w/ Transformers (remote code) Congrats @deepseek_ai for yet another stellar release! 🔥











