Guilin Liu

88 posts

Guilin Liu

Guilin Liu

@GuilinL

Research Scientist at NVIDIA.

Santa Clara, CA Katılım Eylül 2016
285 Takip Edilen888 Takipçiler
Guilin Liu retweetledi
NVIDIA AI
NVIDIA AI@NVIDIAAI·
Meet Nemotron 3 Nano Omni 👋 Our latest addition to the Nemotron family is the highest efficiency, open multimodal model with leading accuracy. 30B parameters. 256K context length. 🧵👇
English
90
190
1.3K
448.9K
Guilin Liu
Guilin Liu@GuilinL·
One of our roles in LLM/VLM research at NVIDIA is to explore effective data recipes for training large-scale models and share them to the public—an area where transparency has been limited, as seen with models like Gemini, GPT-4o, Qwen-VL models etc. The Eagle2 project aligns closely with this mission. In this work, we have openly detailed our findings in curating the datasets to develop a frontier VLM model, and we’re glad to see that the community is finding these contributions valuable.
Andi Marafioti@andimarafioti

The Eagle 2 paper from Nvidia is such a goldmine.

English
0
1
16
1.3K
Guilin Liu retweetledi
Andi Marafioti
Andi Marafioti@andimarafioti·
The Eagle 2 paper from Nvidia is such a goldmine.
Andi Marafioti tweet media
English
1
26
296
19.7K
Guilin Liu retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
🥇Our NVIDIA Llama Nemotron Nano VL model is #1 on the OCRBench V2 leaderboard. Designed for advanced intelligent document processing and understanding, this model extracts diverse info from complex documents with precision, all on a single GPU. 📗 Get the technical details on the newest Nemotron model ➡️ nvda.ws/43L3e0a 📝 Try out the NVIDIA NIM ➡️ nvda.ws/4jyZ0yD
NVIDIA AI Developer tweet media
English
2
48
236
22.6K
Guilin Liu retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
Cool paper from @nvidia Prior methods for training LLMs for tool use rely on imitation or distilled reasoning, limiting generalization. Nemotron-Research-Tool-N1 uses rule-based reinforcement learning. It trains models with binary rewards evaluating only tool call structure and correctness, enabling autonomous reasoning. 📌 Binary format and correct tool call reward teaches autonomous reasoning over imitation. 📌 Binary rule-based reward prevents reward hacking, boosting real-world generalization (80.38 percent Live BFCL). 📌 Using binary rewards on structure and tool call leverages SFT data without detailed reasoning steps. ---------- Methods Explored in this Paper 🔧: → The model uses a structured reasoning and action output format. → A binary reward checks adherence to this format and exact match of parsed tool calls to ground truth. → Training uses the Generalized Reinforcement Policy Optimization GRPO algorithm on processed datasets. → Nemotron-Research-Tool-N1-7B achieved 84.82 percent accuracy on BFCL and 81.28 percent on API-Bank, outperforming GPT-4o. ------------ Paper - arxiv .org/abs/2505.00024v1 Paper Title: "Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning"
Rohan Paul tweet media
English
4
46
193
15.6K
Guilin Liu retweetledi
Shaokun Zhang
Shaokun Zhang@ShaokunZhang1·
Tool-using LLMs can learn to reason—without reasoning traces. 🔥 We present Nemotron-Research-Tool-N1, a family of tool-using reasoning LLMs trained entirely via rule-based reinforcement learning—no reasoning supervision, no distillation. 📄 Paper: arxiv.org/pdf/2505.00024 💻 Code: github.com/NVlabs/Tool-N1 (Please consider giving us a ⭐️ to stay updated on the upcoming code release!) 🧠 Why this matters: Existing tool-call models rely heavily on supervised reasoning traces from stronger models—costly, brittle, and often imitative. We ask: Can LLMs learn to reason directly from tool success signals? 📦 What we did: – Train Qwen2.5-7B/14B with simple binary reward on tool-call correctness + reasoning format in R1-style – No reasoning traces needed – Evaluate on BFCL, API-Bank, and ACEBench – Also study the role of SFT, RL, and widely adopted SFT-then-RL recipes in training Tool-Calling models. 📈 Key findings: – Tool-N1-7B/14B obviously outperform GPT-4o and open baselines on all benchmarks – Widely adopted SFT+RL paradigm doesn’t necessarily lead to better performance than Pure RL. – Binary reward > fine-grained reward, esp. for real-world queries – Scaling works: bigger = better gains under our RL setup 🌟 Takeaway: Reasoning doesn’t have to be taught. With just a binary signal, LLMs can learn to reason and act. Tool-N1 sets a new direction for scalable, supervision-light tool calling model training
Shaokun Zhang tweet media
English
2
94
357
40.5K
Guilin Liu retweetledi
Zhiding Yu
Zhiding Yu@ZhidingYu·
Thank you AK! Excited to introduce Eagle 2.5, NVIDIA’s latest vision-language model that brings strong long-context capabilities across both image and video understanding — all with just 8B parameters. Most existing VLMs struggle with high-res inputs and long video contexts. Eagle 2.5 is designed to tackle both — supporting up to 512 video frames and trained jointly on image + video data. We introduce a new benchmark-scale dataset, Eagle-Video-110K, with over 110K annotated samples, including QA, localization, and summarization. Videos range from a few minutes to 3 hours — pushing the limits of long-form visual reasoning. Key techniques: • Information-First Sampling: spatially aware, quality-preserving frame selection • Mixed image-video training for generalization • Progressive long-context recipes up to 128K tokens • Optimized decoding and inference for efficient deployment Strong results across the board: • 6 out of 10 SOTA on long video benchmarks • Outperforms GPT-4o (0806) on 3/5 video tasks • Outperforms Gemini 1.5 Pro on 4/6 video tasks • Matches or beats Qwen2.5-VL-72B on multiple key datasets • Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL. Evaluated on: • Video-MME • MVBench • Charades-STA • 1-Hour Video QA • EgoSchema • MLVU, LVBench, and more… These tasks stress-test long-form visual understanding with dense supervision and temporal reasoning. Model, demo, and dataset will be released soon. Explore the project here: nvlabs.github.io/EAGLE/ Code: github.com/NVlabs/EAGLE Tech Report: huggingface.co/papers/2504.15… We're excited to contribute toward long-context, general-purpose VLMs — and would love to hear your feedback or ideas for collaboration.
Aran Komatsuzaki@arankomatsuzaki

Nvidia presents Eagle 2.5! - A family of frontier VLMs for long-context multimodal learning - Eagle 2.5-8B matches the results of GPT-4o and Qwen2.5-VL-72B on long-video understanding

English
1
8
53
19.1K
Guilin Liu retweetledi
Jim Fan
Jim Fan@DrJimFan·
Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵
English
94
390
1.9K
465.4K
Guilin Liu retweetledi
Zhiding Yu
Zhiding Yu@ZhidingYu·
Mr. @pmddomingos This is a country whose leader blatantly says "We lied, we cheated, we stole… we had entire training courses." And thus there's conceited clown like you to spread China hate everywhere. Your self-imagined star-spangled awesomeness doesn't change the fact that Chinese researchers have become a major force in the AI community. China is also leading in industry general autonomy, robotics and AI applications. Your word can't change this fact and the successes don't come with fraud. If you think there’s a problem with this, there’s a problem with you.
Pedro Domingos@pmddomingos

Unbelievable. China has a huge problem with scientific fraud, and when someone alludes to it the conversation is about how racist they are.

English
9
11
215
31.2K
Guilin Liu retweetledi
Zhiding Yu
Zhiding Yu@ZhidingYu·
Thank you AK! @_akhaliq This is just a beginning of a long journey, as we focused more on the model design space with multi-encoders, and fair comparisons under controlled settings. More will come in future versions! 🧵[1/n] Try our model & demo: GitHub: github.com/NVlabs/Eagle HuggingFace: huggingface.co/NVEagle Report: huggingface.co/papers/2408.15…
Zhiding Yu tweet media
AK@_akhaliq

Nvidia presents Eagle Exploring The Design Space for Multimodal LLMs with Mixture of Encoders discuss: huggingface.co/papers/2408.15… The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.

English
1
11
61
87K
Guilin Liu
Guilin Liu@GuilinL·
We have also worked on transformer-based diffusion (link below) and video diffusion (research.nvidia.com/labs/dir/pyoco/). However, we did them in two different projects. :) Congrats to OpenAI for proving scaling-up still works for video synthesis.
English
0
1
11
1.7K
Guilin Liu retweetledi
Rafael Valle
Rafael Valle@RafaelValleArt·
My co-authors are presenting P-Flow at NeurIPS on Thursday at 5pm! We'd love to chat about generative models, audio synthesis and understanding! We are also hiring, including for internships, researchers with expertise in multimodal LLMs!
English
2
2
17
2.3K
Guilin Liu
Guilin Liu@GuilinL·
@tariqafridi16 NVIDIA used to have PhD Residency Program before. But it is not active now. We also have internship program. Sometimes we may also extend internship if the project is interesting but not done. For some long-term projects, we may extend the internship longer.
English
1
0
1
101
Tariq Habib Afridi
Tariq Habib Afridi@tariqafridi16·
@GuilinL Are you open to co supervision? I am a PhD student working on Vision Language multimodal. I would love to have guidance from you for my PhD research. By the way i am currently enrolled in PhD in kyung Hee university south Korea.
English
1
0
0
97
Guilin Liu
Guilin Liu@GuilinL·
I am attending #NeurIPS2023 between Dec. 10th and Dec. 16th. We are recruiting researchers to work on multi-modal models and DL for graphics. Would to love to have a chat if interested. We are the team that invented DLSS and Megatron-LM at NVIDIA.
English
1
4
18
5.5K
Guilin Liu retweetledi
Arash Vahdat
Arash Vahdat@ArashVahdat·
📢🔥 My team at NVIDIA research is looking for candidates with a fundamental generative learning background (ideally) in one of these domains: - Gen AI for science (climate, chemistry, biology) - Gen AI for 3D data Apply via: bit.ly/3P9yxMC bit.ly/43ZgpZW
Arash Vahdat tweet media
English
17
67
404
83.2K
Guilin Liu
Guilin Liu@GuilinL·
Thanks all for your interest. We prioritize graduate candidates unless the undergraduate is really exceptional.
English
0
0
1
962
Guilin Liu
Guilin Liu@GuilinL·
We are hiring research interns at NVIDIA to work on large-scale multi-modal models. We prefer the candidate to start the internship recently or in winter and have strong background in vision and/or NLP. Please email to guilinl@nvidia.com if interested.
English
8
44
246
39.5K