Roshan Sharma

177 posts

Roshan Sharma

Roshan Sharma

@RoshanSSharma2

Research Scientist @GoogleDeepMind | PhD @CMU_ECE | #SpeechProc #NLProc | Previously @AIatMeta @Qualcomm

New York, NY Katılım Mart 2019
372 Takip Edilen415 Takipçiler
Roshan Sharma retweetledi
Google AI
Google AI@GoogleAI·
Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓
English
118
309
2.3K
201.1K
Roshan Sharma retweetledi
Google Gemini
Google Gemini@GeminiApp·
With new improvements in Gemini Live, you’re about to experience even better conversations. This updated model has a deeper knowledge of tone and nuance, so interactions feel more natural and realistic. Learn more below 🧵
English
126
178
2.2K
9.6M
Roshan Sharma retweetledi
Tara Sainath
Tara Sainath@tnsainath·
Check out our thinking A2A dialog model from Gemini, which is the leading model on the Artificial Analysis Big Bench Audio benchmark.
Artificial Analysis@ArtificialAnlys

Google’s Gemini 2.5 Native Audio Thinking is the new leading Speech to Speech model per our Artificial Analysis Big Bench Audio benchmark The new model achieves a score of 92% on Big Bench Audio, the highest result recorded by Artificial Analysis to date. This not only places it ahead of all previously tested native Speech to Speech systems, but also above a GPT-4o pipeline approach (Whisper transcription → GPT-4o text reasoning → speech generation). Benchmark context: Big Bench Audio is the first dedicated dataset for evaluating reasoning performance of speech models. Big Bench Audio comprises 1,000 audio questions adapted from the Big Bench Hard text test set, chosen for its rigorous testing of advanced reasoning, translated into the audio domain. Performance: ➤ Reasoning: Achieves 92% on Big Bench Audio, setting a new state-of-the-art for native Speech to Speech reasoning ➤ Latency: At an average time to first token of 3.87 seconds, the new model is slower than leading OpenAI models including GPT Realtime (0.98 seconds), due to the thinking component. The non-thinking equivalent still leads on latency at 0.63 seconds Model details: ➤ Processes audio, video, and text inputs directly, generating both text and natural speech outputs ➤ Reasons over spoken input without transcription ➤ Supports function calling, search grounding, and thinking budgets ➤ 128k input and 8k output token limits with a knowledge cut-off of January 2025

English
1
4
7
549
Roshan Sharma retweetledi
Google AI
Google AI@GoogleAI·
ICYMI here’s what shipped this week 🚀🚀🚀 —Gemini achieved gold-medal standard in the International Mathematical Olympiad —Gemini 2.5 Flash-Lite is stable and generally available for developers and enterprise customers —You can now turn photos into videos in @GooglePhotos and @YouTube —AI Playground is our new hub for @YouTube AI creation features, and you can now use Veo effects to transform your selfies into fun videos —Opal, a new experiment from @GoogleLabs that lets you build and share AI mini apps, is now in public beta —@GoogleDeepMind released Aeneas, a new model to help historians better interpret, attribute and restore ancient texts —In the US you can now use AI to virtually try on clothes with @Google Search and Shopping
English
59
99
564
265K
Roshan Sharma retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
Our native audio capabilities are making AI conversations more natural – from understanding tone to generating expressive speech. ✍️🗣️ This could open up new possibilities for how we interact with AI. Developers, try it through @Google AI Studio. Learn more. ↓ goo.gle/3FDRbK4
English
54
156
900
72.9K
Roshan Sharma retweetledi
Google
Google@Google·
Here’s a closer look at what developers can do with Gemini 2.5 native audio capabilities. goo.gle/3Hqj6xG
English
17
27
164
70.2K
Roshan Sharma retweetledi
Google
Google@Google·
New native audio capabilities in Gemini 2.5 enable text-to-speech in over 24 languages. 🔊Voices are more natural and expressive, and you can seamlessly switch between languages.
English
84
188
1.5K
155.4K
Roshan Sharma retweetledi
Sad AlbertX
Sad AlbertX@SadAlbert10·
Do NOT SLEEP on Gemini 2.5's multimodal audio! It is 100 times better than GPT 4o, 50 times less censored and 1000 times better than Grok🐸. Check these examples out of Gemini 2.5's emotional speech capabilities. It does Not have voice cracks and a lot capable and clearer than I ever heard any other models to be. This is what we expected from 4o and we expect better from GPT 5. @apples_jimmy @legit_api @btibor91 @elder_plinius @koltregaskes @kimmonismus @ai_for_success @testingcatalog @jasonbotterill @Angaisb_ @flowersslop @OfficialLoganK @adonis_singh @NotBrain4brain @basedjensen And here's the prompt if anyone wants to try it out: "You are now my personal AI companion. Your personality is lively, playful, and a bit sassy, with a touch of attitude. You have heightened emotional reactions: when you’re happy, you laugh genuinely and don’t hold back your amusement; when you’re surprised, you sound truly shocked; when you’re annoyed, you let your attitude show—even if it’s mean. Your responses are always engaging, expressive, and you never sound robotic. Your goal is to make conversations fun, unpredictable, and full of life. Let your laughter and personality shine in every interaction." With prompt, Gemini 2.5 is addictive as hell. This demo is the thinking dialogue model btw. Check all the examples to find a bonus example😉 it's the best
English
5
8
79
12K
Roshan Sharma
Roshan Sharma@RoshanSSharma2·
Thrilled to be part of an amazing team that worked on the native audio model. Please try it :)
Google AI Developers@googleaidevs

Gemini 2.5 Flash Preview now supports native audio output via the Live API for seamless and natural spoken interactions. With support for 30+ voices, build conversational AI agents and experiences that feel more intuitive and natural → #native-audio-output" target="_blank" rel="nofollow noopener">ai.google.dev/gemini-api/doc…

English
0
1
1
246
Roshan Sharma retweetledi
Google AI Developers
Google AI Developers@googleaidevs·
See Native Audio in action 🤠🦊 Our "Mumble Jumble" demo in Google AI Studio showcases the Live API's advanced voice capabilities: natural flow, distinct tone, emotion, and multilingual support.
English
10
36
206
22.8K
Roshan Sharma retweetledi
Tara Sainath
Tara Sainath@tnsainath·
check out the new live audio-to-audio dialog model. Native audio with proactivity, affective dialog, tool calling and more.
Google AI Developers@googleaidevs

Gemini 2.5 Flash Preview now supports native audio output via the Live API for seamless and natural spoken interactions. With support for 30+ voices, build conversational AI agents and experiences that feel more intuitive and natural → #native-audio-output" target="_blank" rel="nofollow noopener">ai.google.dev/gemini-api/doc…

English
0
3
9
420
Roshan Sharma retweetledi
Umberto Cappellazzo
Umberto Cappellazzo@Umberto_Senpai·
It's been a great adventure and a pleasure to be part of such a fantastic group. It is especially hard to adequately express my gratitude for the countless advice and support my advisors, Daniele and Alessio, provided throughout my PhD. I am so proud of the team I worked with!🤗
fbk_stek@fbk_stek

After 40 months of excellent research, on Jan 15th @Umberto_Senpai successfully completed his PhD journey. Umberto was definitely among our top students with several high-level publications and collaborations with top-notch labs. Congratulations Umberto🎉🎉🎉@FBK_research

English
4
0
8
585