Trainy

61 posts

Trainy

@TrainyAI

Building open source tools for distributed training.

Katılım Haziran 2023

30 Takip Edilen71 Takipçiler

Trainy retweetledi

roanak@roanakb·5 Şub

NeptuneAI shuts down March 5th. @TrainyAI just launched Pluto on @ycombinator, a drop-in replacement so you don't lose years of experiment data. Swap one import. Dual-log to validate. Export your history. Open source. On Neptune's official transition hub. ycombinator.com/launches/PLM-p…

English

7.2K

Trainy retweetledi

roanak@roanakb·29 Eki

@TrainyAI's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Check out our docs: konduktor.readthedocs.io/en/latest/

English

245

Trainy retweetledi

roanak@roanakb·29 Eki

This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts.

English

231

Trainy retweetledi

roanak@roanakb·29 Eki

Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently.

English

232

Trainy retweetledi

roanak@roanakb·24 Eki

Is your team struggling with GPU failures? Let’s talk! Docs: konduktor.readthedocs.io/en/latest/admi…

English

115

Trainy retweetledi

roanak@roanakb·24 Eki

At @TrainyAI, we built a controller within Konduktor to monitor GPU node health and isolate unhealthy nodes. This way if a job fails, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant GPU/NCCL logs to your CSP. 🚀

English

104

Trainy retweetledi

roanak@roanakb·24 Eki

ML engineers shouldn’t be wasting time debugging infrastructure — especially when H100s have a 25-30% fault rate. 🛠️ ML infrastructure should be able to handle bumps and bruises to the underlying hardware.

English

144

Trainy retweetledi

roanak@roanakb·21 Eki

4/ Struggling with multinode setups on your cloud provider? We'll cut your setup time from weeks to minutes. Docs: konduktor.readthedocs.io/en/latest/inde…

English

Trainy retweetledi

roanak@roanakb·21 Eki

3/ One of the biggest value-adds of @TrainyAI's Konduktor platform is that we simplify this complexity. We abstract away network configurations, so you can launch multinode training with high-bandwidth networking across different clouds in the same way.

English

Trainy retweetledi

roanak@roanakb·21 Eki

2/ At @TrainyAI, we've seen AI research teams lose over $10,000 trying to scale out due to misconfigured GPU fabrics. That's a costly mistake that can be avoided.

English

Trainy retweetledi

roanak@roanakb·21 Eki

Setting up and validating GPU networking is a lot less trivial than you'd think. Here's why: 1/ GPU fabric technology varies a lot across cloud providers for the H100. For example, Google Cloud has TCP-X, while AWS uses EFA. Once you commit to one setup, it often locks you in.

English

152

Trainy retweetledi

roanak@roanakb·17 Eki

He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: youtube.com/watch?v=s7_Nlk…

YouTube

English

102

Trainy retweetledi

roanak@roanakb·17 Eki

3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate.

English

Trainy retweetledi

roanak@roanakb·17 Eki

2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness

English

Trainy retweetledi

roanak@roanakb·17 Eki

1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization

English

Trainy retweetledi

roanak@roanakb·17 Eki

The latest Machine Learning Street Talk (MLST) episode, with François Chollet discussing inherent limitations of LLMs, was amazing. It was a breath of fresh air to hear some sound reasoning after all the usual Doomer/Acceleration talk on AGI. He makes some great points:

English

151

Trainy retweetledi

roanak@roanakb·15 Eki

With the features above and more, AI teams using @TrainyAI's Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click here to check out our docs: konduktor.readthedocs.io/en/latest/inde….

English

Trainy retweetledi

roanak@roanakb·15 Eki

3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, across different jobs and teams.

English

Trainy retweetledi

roanak@roanakb·15 Eki

2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (~30%). Konduktor detects hardware issues on failure, resumes jobs on healthy GPUs, and alerts your provider with logs.

English

Trainy retweetledi

roanak@roanakb·15 Eki

@TrainyAI's Konduktor platform is here to change that. 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the P0 workloads get run first, and your GPUs keep crunching numbers 24/7.

English

Keşfet

@ycombinator @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine