Bandhav Veluri
34 posts

Bandhav Veluri
@b_veluri
Research @sesame Prev: @AIatMeta, PhD @uwcse, and IIT Roorkee.











Like humans on a phone call, can Spoken LLMs adapt to network latency? Our #EMNLP2024 paper, SyncLLM (jointly done at @uwcse & @AIatMeta), shows they can learn such an ability... consequently, full-duplex voice AI systems can hide latency with high token throughput! (1/4)

Excited to share our new paper: IRIS, a wireless ring for vision-based smart home control. Tired of awkward voice commands? We built a ring that streams camera data to your phone, where a neural net identifies the device you want to control—all with a simple point-and-click.

Open science is how we continue to push technology forward and today at Meta FAIR we’re sharing eight new AI research artifacts including new models, datasets and code to inspire innovation in the community. More in the video from @jpineau1. This work is another important step towards our goal of achieving Advanced Machine Intelligence (AMI). What we’re releasing: • Meta Spirit LM: An open source language model for seamless speech and text integration. • Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. Plus a new developer suite to make it easier for developers to build with SAM 2. • Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance. • SALSA: New code to enable researchers to benchmark AI-based attacks in support of validating security for post-quantum cryptography. • Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale. • Meta Open Materials: New open source models and the largest dataset of its kind to accelerate AI-driven discovery of new inorganic materials. • MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder with coverage across 80 languages. • Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations. Access to state-of-the-art AI creates opportunities for everyone. We’re excited to share this work and look forward to seeing the community innovation that results from it. Details and access to everything released by FAIR today ➡️ go.fb.me/hgtkel




@_akhaliq Awesome work! How accurate is the binaural targeting? Seems like it could be very useful for improving transcription in complex environments if a participant assisted by intentionally capturing each speakers sample

Look Once to Hear Target Speech Hearing with Noisy Examples In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence.

Want to hear a friend in a noisy café? We designed deep learning-based headphones that let you isolate the speech from a specific person just by *looking* at them for a few seconds. CHI'24 honorable mention award. Paper: arxiv.org/abs/2405.06289 Code: github.com/vb000/LookOnce…



AI headphones let wearer listen to a single person in a crowd, by looking at them just once. The system, called “Target Speech Hearing,” then cancels all other sounds and plays just that person’s voice in real time even as the listener moves around in noisy places and no longer faces the speaker. [read more: washington.edu/news/2024/05/2…]

