Deepak Narayanan
402 posts

Deepak Narayanan
@deepakn94
Research Scientist at @nvidia. Interested in the intersection of Computer Systems and ML. Occasionally tweet about sports. Views are my own.






Nemotron-H: A family of Hybrid Mamba-Transformer LLMs. * Hybrid architecture means up to 3X faster at the same accuracy * Trained in FP8 * Great for VLMs * Weights and instruct versions to come soon. research.nvidia.com/labs/adlr/nemo…



@StoolGreenie Is there a certain # the Celtics need to win by tonight to be in, or is it just as much as possible to feel safe

@MistralAI and @nvidia announce Mistral-NeMo 12B, an awesome bite-size model released under Apache 2.0 that we jointly trained. FP8 aligned checkpoint and 128k context window, great benchmark scores. blogs.nvidia.com/blog/mistral-n… mistral.ai/news/mistral-n…




YC W24's @team_patchwork supercharges your team’s communication — they replace your Slack with a ranked feed of information, AI-assisted post creation, and chat for urgent needs. ycombinator.com/launches/KVH-p… Congrats on the launch, @nikkiashah & @dhruhink!





Seeing a lot of discussion around the fairness of this leaderboard, so figured I would bring up our recent paper that appeared at NeurIPS (openreview.net/pdf?id=RJpAz15…) where we look into some of the issues that make it hard to compare LLM APIs. (1/n)



📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)



📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference solutions, featuring key providers such as @replicate, @awscloud, and @togethercompute! You can find the leaderboard here: github.com/ray-project/ll… The LLMPerf leaderboard tracks three main metrics: time-to-first-token, inter-token latency, and success rate. - Time-to-first-token (TTFT) measures the time it takes between the query and the first response of the provider. TTFT is especially important for interactive and streaming applications, such as chatbots. - Inter-token latency measures the average time between consecutive tokens. This is important for applications that require the entirety of the response to be ready, like summarization tasks or agent use cases. - Finally, success rate measures the number of successful responses where the inference API operates without errors. This measure reflects the reliability and stability of API provider. Blog announcement: anyscale.com/blog/comparing… (1/2)




