Ahmet Iscen

45 posts

Ahmet Iscen

Ahmet Iscen

@ahmetius

Research scientist at Google DeepMind

Katılım Ekim 2019
169 Takip Edilen672 Takipçiler
Ahmet Iscen retweetledi
Alireza Fathi
Alireza Fathi@alirezafathi·
Our team at Google DeepMind Foundational Research has an opening for a full-time Research Scientist! Areas of Interest are Multimodal, 3D and Spatial Reasoning, Self-improving Agents. Looking for candidates with strong publications at top ML and CV conferences. Email: af_hiring@google.com
English
2
28
346
37.5K
Ahmet Iscen retweetledi
Alireza Fathi
Alireza Fathi@alirezafathi·
Our team at Google DeepMind Foundational Research is hiring full-time Research Scientists and Research Interns! Multimodal, Reasoning, self-improving agents, Video Understanding. Looking for candidates with strong papers at top ML and CV conferences. Email: af_hiring@google.com
English
13
64
618
61.4K
Ahmet Iscen
Ahmet Iscen@ahmetius·
Want to work on the future of multimodal AI? Our Google DeepMind team in Grenoble, led by @CordeliaSchmid, is hiring interns for multimodal AI research (long-video understanding and visual reasoning in 2D and 3D). Email ai.gnb.hiring@gmail.com or find me at #NeurIPS2024!
Ahmet Iscen tweet media
English
5
16
182
16K
Ahmet Iscen
Ahmet Iscen@ahmetius·
Our new #NeurIPS2024 paper tackles web-scale visual entity recognition by automatically curating a training dataset with a multimodal LLM, achieving SOTA results (+6.9% on OVEN)! Learn how we use multimodal LLMs for label verification and data enrichment: arxiv.org/abs/2410.23676
Ahmet Iscen tweet media
English
2
3
25
4K
Ahmet Iscen retweetledi
Alireza Fathi
Alireza Fathi@alirezafathi·
Our team at Google DeepMind is seeking a Research Scientist with a strong publication record (multiple first-author papers) on multi-modal LLMs in top ML venues like NeurIPS, ICLR, CVPR. Email me at af_hiring@google.com @CordeliaSchmid
English
4
47
378
53.2K
Ahmet Iscen retweetledi
Yisong Yue
Yisong Yue@yisongyue·
In case you missed our #ICML2024 oral presentation, check out SceneCraft, an LLM Agent for writing Blender-executable code that can render complex scenes with up to a hundred 3D assets. Paper: arxiv.org/abs/2403.01248 The SceneCraft agent is able to do complex spatial planning and arrangement, by maintaining a scene graph blueprint, and detailing spatial relationships among assets in the scene. SceneCraft leverages VLMs to iteratively refine a scene, and library learning to build a reusable spatial skill library. Taken together, SceneCraft is able to handle increasingly complex scenes and descriptions without external human expertise or LLM parameter tuning. This work was led by amazing @acbuller in collaboration with awesome colleagues at Google: @ahmetius, @aashi7jain, @tkipf, David Ross, @CordeliaSchmid, Alireza Fathi.
Yisong Yue tweet media
English
3
18
80
8.5K
Ahmet Iscen retweetledi
Arsha Nagrani
Arsha Nagrani@NagraniArsha·
Ahmet @ahmetius, @CordeliaSchmid and I are looking to hire a student researcher at @GooglDeepMind this fall! Start: September Loc: Cam, USA but flexible Unfortunately I’m not at @CVPRConf this year (2 month old baby!!👶) but pls find Ahmet or Cordelia @CVPR if interested!
English
6
10
81
24.3K
Ahmet Iscen
Ahmet Iscen@ahmetius·
🔥 Calling all #CVPR2024 attendees! 🔥 Join us for the 1st Tool-Augmented VIsion (TAVI) Workshop on Monday morning in Summit 321! 💡 5 inspiring keynote talks 🎨 5 invited posters from the main conference Don't miss out! ➡️ More info: sites.google.com/corp/view/tavi…
Ahmet Iscen tweet mediaAhmet Iscen tweet media
English
1
7
21
12.4K
Ahmet Iscen
Ahmet Iscen@ahmetius·
VLMs are great, but can we use their generative capabilities for web-scale entity recognition? GERALD leverages VLMs to generate unambiguous, language-based and discriminative codes for 6M-scale entity recognition. Looking forward to present GERALD at CVPR24!
Mathilde Caron@mcaron31

Happy to introduce GERALD - our new VLM that recognizes 6M+ entities, an exciting step towards Web-scale visual entity recognition! Predictions are simply made by auto-regressively decoding a code representing the entity name. Check out our CVPR24 paper: arxiv.org/abs/2403.02041

English
0
0
17
1.9K
Ahmet Iscen retweetledi
Ziniu Hu
Ziniu Hu@acbuller·
Interested in LLM + Tool-Use, via Tree-Search? This afternoon in #NeurIPS2023, #215, I'll present "AVIS: Autonomous Visual Information Seeking with Large Language Model Agent" (blog.research.google/2023/08/autono…) Feel free to drop by and chat.
English
2
25
146
17.3K
Ahmet Iscen retweetledi
Sundar Pichai
Sundar Pichai@sundarpichai·
Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, it’s the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks. With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU. blog.google/technology/ai/…
Sundar Pichai tweet media
English
930
3.6K
22.5K
5M
Ahmet Iscen retweetledi
Ahmet Iscen
Ahmet Iscen@ahmetius·
How do we find information on the web? We try to address this question in AVIS, by coupling #LLM-based reasoner and planner with external tools, e.g. search. This results in a significant performance increase in challenging fine-grained VQA datasets, where SOTA VLMs struggle.
AK@_akhaliq

AVIS: Autonomous Visual Information Seeking with Large Language Models paper page: huggingface.co/papers/2306.08… In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

English
0
2
10
1.1K
Ahmet Iscen retweetledi
Alireza Fathi
Alireza Fathi@alirezafathi·
🚀Introducing AVIS: a groundbreaking system that couples #LLM powered planning & reasoning with external tools, resulting in #StateOfTheArt performance on VQA datasets that demand external knowledge! 🧠🔍
AK@_akhaliq

AVIS: Autonomous Visual Information Seeking with Large Language Models paper page: huggingface.co/papers/2306.08… In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

English
0
3
10
1.9K