Sabitlenmiş Tweet
粵語計算語言學基礎建設組 CanCLID
95 posts

粵語計算語言學基礎建設組 CanCLID
@Can_CLID
致力於書面粵語推廣、粵拼推廣、粵語 NLP 技術開發、粵語語料庫建設、粵語教學資源建設。 聯繫郵箱:[email protected] Cantonese Computational Linguistics Infrastructure Development Workgroup
Canton,Hong Kong,San Francisco Katılım Ocak 2022
21 Takip Edilen445 Takipçiler

CanCLID 最新作品,粵語辭典匯聚網站,粵語辭叢正式上線 jyutjyu.com
目前收錄 11 本粵語詞典總共超過 26 萬個詞條,之後仲會陸續增加,歡迎捉錯反饋意見。希望幫到全世界嘅粵語老師同學生🥳
中文

數據集下載頁huggingface.co/datasets/CanCL…
項目主頁 canclid.github.io/zoengjyutgaai/ 歡迎分享傳播
中文

張悦楷語音數據集再次大幅更新:新加咗 75.71 個鐘嘅《鹿鼎記》數據,而家總數據量已經達到 188.25 個鐘喇!作為效果展示,我哋訓練咗一個張悦楷 TTS 系統,大家只要開一個Hugging Face 賬户就可以免費任玩!huggingface.co/spaces/laubong…
中文
粵語計算語言學基礎建設組 CanCLID retweetledi

😼SMOL DATA ALERT! 😼Anouncing SMOL, a professionally-translated dataset for 115 very low-resource languages! Paper: arxiv.org/pdf/2502.12301
Huggingface: huggingface.co/datasets/googl…

English

張悦楷語音數據集最尾一個子集《走進毛澤東的黃昏歲月》已經上傳完畢,加上之前嘅三國演義同水滸傳,總共有112個鐘嘅高質語音數據喇!
The last subset of Zoeng Jyut Gaai Speech Dataset, The final days of Mao Ze Dong, is now fully uploaded. We have 112 hours now!
huggingface.co/datasets/CanCL…
中文

張悦楷數據集迎來最大更新:新加咗 38.62 個鐘張悦楷講《水滸傳》,加上原有嘅三國演義數據,總時長達到 104.64 個鐘!HF 倉庫亦正式改名為 CanCLID/zoengjyutgaai
huggingface.co/datasets/CanCL…
主頁亦已加入最新統計信息 canclid.github.io/zoengjyutgaai/
請大家多多分享支持,令我哋繼續出品優質數據集!
中文
粵語計算語言學基礎建設組 CanCLID retweetledi

I contributed a chapter titled "Ideologically Driven Divergence in Cantonese Vernacular Writing Practices" to J-F Dupré's forthcoming book "The Politics of Language in Hong Kong", releasing Dec 2024. It is part of a new book series on Hong Kong research.
routledge.com/9781032648453

English

作為用嚟示範,呢個係用本數據集訓練出嚟嘅 TTS (語音合成)模型,你可以用楷叔把聲嚟講你想聽嘅嘢!
As an example use case, this is a TTS demo trained with this dataset. You read anything aloud with Zoeng Jyut Gaai's voice!
huggingface.co/spaces/laubong…
中文

張悦楷講古語音數據集正式完工!總共 66 個鐘嘅高質粵語語音數據,就算唔用嚟整 AI 技術都可以直接下載 webm 同 srt 字幕落嚟當故仔噉聽。亦都可以用嚟做語言學、文學研究。數據集主頁:
canclid.github.io/zoengjyutgaai/
The Zoeng Jyut Gaai dataset is officially released! 66 hours of high quality data!
中文

免費粵文字幕SRT生成器! 準過Subanana!請大家多多分享傳播!
Free Cantonese subtitles generator! Please share and spread the word!
huggingface.co/spaces/laubong…
中文

@feilung_lau README入面有寫步驟,首先你台電腦要有python,然後跑pip install -r requirements.txt,然後再跑個 python cli.py input.mp3
日本語

目前最好用嘅粵文字幕生成器,輸入音頻(.mp3 .wav 等等)自動出 SRT文件。免費開源,準過 subanana!歡迎外部貢獻同意見反饋!
State-of-the-art Cantonese subtitles generator, more accurate than Subanana! Contributions and feedback welcomed!
github.com/hon9kon9ize/yu…
中文

@si_pbc whats the language distribution and domain distribution of these 20 million hours of data? Is it English only or multilingual? Are they all podcasts from YouTube or it has music, background noise etc as well?
English

呢個數據集取材自張悦楷,已故廣州最出名嘅講古佬 zh.wikipedia.org/wiki/%E5%BC%A0… 最出名嘅一部作品《三國演義》。呢個數據集唔單只可以用嚟做TTS,仲可以做 ASR 測試集或者語音模型預訓練數據集,例如github.com/AlienKevin/can…。我哋嘅目標係整晒157集總共超過70個鐘嘅錄音, 麻煩大家多多支持!
中文

CanCLID 最新作品,目前全網唯一免費開源嘅粵語 TTS 數據集,張悦楷講三國演義,隆重登場:huggingface.co/datasets/laubo…
呢個數據集啱啱開工目前得 55 分鐘。如果你想幫手、加速擴展數據集嘅話歡迎聯繫我哋!
New open-sourced Cantonese TTS dataset available now! Contact us if you want to help!
中文




