ZD1908

3.5K posts

ZD1908

ZD1908

@ZDi____

🇦🇷 25M ; ML Text-to-Speech/Audio, C++/Qt / DMs open

Latent space Katılım Haziran 2024
424 Takip Edilen302 Takipçiler
Sabitlenmiş Tweet
ZD1908
ZD1908@ZDi____·
Releasing Brontes: A modified Wave U-Net architecture for audio super-resolution. This one is trained to operate on NeuCodec outputs. I'm releasing a general 30M checkpoint on a variety of speech. See links in replies. All on MI300X thanks to @HotAisle @AIatAMD
English
3
1
8
1K
ZD1908
ZD1908@ZDi____·
If you're cold emailing people, you have to prepend [Not AI slop] to the subject line, and write the whole thing with your own two hands.
English
0
0
0
22
ZD1908
ZD1908@ZDi____·
Hacker News is such a primitive platform it hurts. Even 4chan has autorefresh.
English
0
0
0
23
ZD1908
ZD1908@ZDi____·
@Wolvan1 Bitwise Operator's shaking that voluminous derrière.
Română
1
0
0
10
ZD1908
ZD1908@ZDi____·
@ad0rnai >terminally offline >on X (formerly Twitter) Do you also go to butcher shops looking for vegans?
English
1
0
7
478
Lan
Lan@ad0rnai·
I am hiring someone who is: - terminally offline - not in any group chats - has covered the curriculum of the great books program - idiosyncratic - slightly off-putting
English
40
5
377
14.1K
ZD1908
ZD1908@ZDi____·
All of this was iterated start to end on a single AMD Instinct MI300X for ~5 days, thanks to @HotAisle and @AIatAMD. Audio quality isn't the best, but only so much one can do with few parameters. Scaling up will be key.
English
0
1
1
254
ZD1908
ZD1908@ZDi____·
VITS EVOlution, my TTS model: 1. ~31M TTS model and speaker encoder in ONNX format. Faster than realtime on CPU 2. Natively outputs 48KHz audio 3. Voice cloning, or voice blending--mix two or more speakers to make a new voice! 4. Apache 2.0, use anywhere without worry Links in replies:
English
1
1
4
120
ZD1908
ZD1908@ZDi____·
My model is working and ready for release tomorrow but I came down with a cold today.
English
0
0
0
72
alli
alli@sonofalli·
reporting to a middle-aged girl dad will change your life
English
70
279
9.1K
1.3M
ZD1908
ZD1908@ZDi____·
@giffmana Bigger models, plus switch from mostly convnet-based--most of the time U-Net (locally coherent, globally weak) to DiT (locally and globally strong), although newer conv is starting to emerge. x.com/miru_why/statu…
miru@miru_why

Reviving ConvNeXt for Efficient Convolutional Diffusion Models github.com/star-kwon/FCDM arxiv.org/abs/2603.09408… the authors propose an improved convnext-based diffusion model architecture that reportedly matches DiT-XL/2 quality with 7x fewer training steps

English
0
0
28
6.1K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I have a question about last year's image-generation progress, wonder what y'all think. How did we go from all models consistently getting fingers wrong, to all models consistently getting them right? This "flip" seems to have happened basically across all companies/models at the ~same time. Even "random" non-frontier papers seem to get it right? Or they just cherry-pick the figures?
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
English
84
15
483
109K
ZD1908
ZD1908@ZDi____·
Using DistilHuBERT features as speaker encoder also failed. Time to throw GPT 5.4 at the Resemblyzer repo and have it modernize the pipeline.
English
0
0
0
54
ZD1908 retweetledi
RoyalCities
RoyalCities@RoyalCities·
After months of work, today I’m releasing Foundation-1. A SOTA text-to-sample model built specifically for music production workflows. It may also be the most advanced AI sample generator currently available - open or closed. • ~7 GB VRAM • Entirely local • 100% free 😁
English
81
148
1.3K
106.9K
ZD1908
ZD1908@ZDi____·
I should get into rocketry, I want to make guided missiles for delivery of medical supplies.
English
0
0
0
32
Alfauz
Alfauz@Alfauz19767861·
My understanding of Peronism
Alfauz tweet media
English
102
201
2.9K
212K
ZD1908
ZD1908@ZDi____·
Well... I was going to release a zero-shot modeltoday, but my speaker encoder sucks and collapses every voice into an audiobook reader, as it was trained on a small amount of data. I'm going to use DistilHuBERT features instead.
English
0
0
1
91