Trainy

61 posts

Trainy banner
Trainy

Trainy

@TrainyAI

Building open source tools for distributed training.

Katılım Haziran 2023
30 Takip Edilen63 Takipçiler
Trainy retweetledi
roanak
roanak@roanakb·
NeptuneAI shuts down March 5th. @TrainyAI just launched Pluto on @ycombinator, a drop-in replacement so you don't lose years of experiment data. Swap one import. Dual-log to validate. Export your history. Open source. On Neptune's official transition hub. ycombinator.com/launches/PLM-p…
English
2
2
19
7.2K
Trainy retweetledi
roanak
roanak@roanakb·
@TrainyAI's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Check out our docs: konduktor.readthedocs.io/en/latest/
English
0
2
2
224
Trainy retweetledi
roanak
roanak@roanakb·
This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts.
English
1
1
2
220
Trainy retweetledi
roanak
roanak@roanakb·
Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently.
English
1
2
2
223
Trainy retweetledi
roanak
roanak@roanakb·
At @TrainyAI, we built a controller within Konduktor to monitor GPU node health and isolate unhealthy nodes. This way if a job fails, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant GPU/NCCL logs to your CSP. 🚀
English
1
1
1
101
Trainy retweetledi
roanak
roanak@roanakb·
ML engineers shouldn’t be wasting time debugging infrastructure — especially when H100s have a 25-30% fault rate. 🛠️ ML infrastructure should be able to handle bumps and bruises to the underlying hardware.
English
1
2
2
142
Trainy retweetledi
roanak
roanak@roanakb·
3/ One of the biggest value-adds of @TrainyAI's Konduktor platform is that we simplify this complexity. We abstract away network configurations, so you can launch multinode training with high-bandwidth networking across different clouds in the same way.
English
1
1
1
62
Trainy retweetledi
roanak
roanak@roanakb·
2/ At @TrainyAI, we've seen AI research teams lose over $10,000 trying to scale out due to misconfigured GPU fabrics. That's a costly mistake that can be avoided.
English
1
1
1
42
Trainy retweetledi
roanak
roanak@roanakb·
Setting up and validating GPU networking is a lot less trivial than you'd think. Here's why: 1/ GPU fabric technology varies a lot across cloud providers for the H100. For example, Google Cloud has TCP-X, while AWS uses EFA. Once you commit to one setup, it often locks you in.
English
1
2
2
150
Trainy retweetledi
roanak
roanak@roanakb·
He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: youtube.com/watch?v=s7_Nlk…
YouTube video
YouTube
English
0
1
3
95
Trainy retweetledi
roanak
roanak@roanakb·
3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate.
English
1
1
2
53
Trainy retweetledi
roanak
roanak@roanakb·
2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness
English
1
1
2
37
Trainy retweetledi
roanak
roanak@roanakb·
1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization
English
1
1
2
37
Trainy retweetledi
roanak
roanak@roanakb·
The latest Machine Learning Street Talk (MLST) episode, with François Chollet discussing inherent limitations of LLMs, was amazing. It was a breath of fresh air to hear some sound reasoning after all the usual Doomer/Acceleration talk on AGI. He makes some great points:
English
1
2
3
149
Trainy retweetledi
roanak
roanak@roanakb·
With the features above and more, AI teams using @TrainyAI's Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click here to check out our docs: konduktor.readthedocs.io/en/latest/inde….
English
0
1
1
51
Trainy retweetledi
roanak
roanak@roanakb·
3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, across different jobs and teams.
English
1
1
1
47
Trainy retweetledi
roanak
roanak@roanakb·
2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (~30%). Konduktor detects hardware issues on failure, resumes jobs on healthy GPUs, and alerts your provider with logs.
English
1
1
1
27
Trainy retweetledi
roanak
roanak@roanakb·
@TrainyAI's Konduktor platform is here to change that. 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the P0 workloads get run first, and your GPUs keep crunching numbers 24/7.
English
1
1
1
32