Yuriy Khamzyaev
213 posts

Yuriy Khamzyaev
@walkeryr
Senior Software Engineer at @EF

And here are my takeaways from that conversation: oreilly.com/radar/is-ai-a-…



*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1: - Visual IQ test: yes, the ones that humans take! - OCR-free reading comprehension: input a screenshot, scanned document, street sign, or any pixels that contain text. Reason about the contents directly without explicit OCR. This is extremely useful to unlock AI-powered apps on multimedia web pages, or “text in the wild” from real world cams. - Multimodal chat: have a conversation about a picture. You can even provide “follow-up” images in the middle. - Broad visual understanding abilities, like captioning, visual question answering, object detection, scene layout, common sense reasoning, etc. - Audio & speech recognition (??): wasn’t mentioned in Kosmos-1 paper, but Whisper is already an OpenAI API and should be fairly easy to integrate. Note: the predictions are based on what Andreas Braun, Microsoft Germany CTO, allegedly said. They may or may not be accurate (that’s why I call it “prediction”). But Kosmos-1 is very real and rock solid. It offers a glimpse of either GPT-4 or whatever AI service that Microsoft will provide next. I find it difficult to believe Kosmos-1 will stay in the lab and not become a product. In any case, prepare yourself for multimodal APIs - they’ll happen sooner or later!


AI has officially replaced CEOs.
















