粵語計算語言學基礎建設組 CanCLID

95 posts

粵語計算語言學基礎建設組 CanCLID

@Can_CLID

致力於書面粵語推廣、粵拼推廣、粵語 NLP 技術開發、粵語語料庫建設、粵語教學資源建設。聯繫郵箱：[email protected] Cantonese Computational Linguistics Infrastructure Development Workgroup

Canton,Hong Kong,San Francisco Katılım Ocak 2022

21 Takip Edilen445 Takipçiler

Sabitlenmiş Tweet

粵語計算語言學基礎建設組 CanCLID@Can_CLID·30 Nis

It has come to our attention that many researchers and developers are confused by the existence of both zh-hk (Hong Kong Chinese) and yue (Cantonese) on Common Voice, and don't know which one to use. Our short answer is: use yue, and do NOT use zh-hk. Reasons below.

English

2.3K

粵語計算語言學基礎建設組 CanCLID@Can_CLID·24 Mar

CanCLID 最新作品，粵語辭典匯聚網站，粵語辭叢正式上線 jyutjyu.com 目前收錄 11 本粵語詞典總共超過 26 萬個詞條，之後仲會陸續增加，歡迎捉錯反饋意見。希望幫到全世界嘅粵語老師同學生🥳

中文

778

粵語計算語言學基礎建設組 CanCLID@Can_CLID·1 Eyl

數據集下載頁huggingface.co/datasets/CanCL… 項目主頁 canclid.github.io/zoengjyutgaai/ 歡迎分享傳播

中文

122

粵語計算語言學基礎建設組 CanCLID@Can_CLID·1 Eyl

張悦楷語音數據集再次大幅更新：新加咗 75.71 個鐘嘅《鹿鼎記》數據，而家總數據量已經達到 188.25 個鐘喇！作為效果展示，我哋訓練咗一個張悦楷 TTS 系統，大家只要開一個Hugging Face 賬户就可以免費任玩！huggingface.co/spaces/laubong…

中文

232

粵語計算語言學基礎建設組 CanCLID@Can_CLID·16 Haz

Our Zoeng Jyut Gaai speech dataset has 126k downloads last month😱🤩🥳 One of the top-100 most downloaded datasets on Hugging Face! We appreciate everyone's support and more updates are on the way! 張悦楷語音數據集上個月有 12.6 萬次下載，係 HF 前一百下載量數據集之一！

中文

778

粵語計算語言學基礎建設組 CanCLID retweetledi

iseeaswell꩜bʂky@iseeaswell·19 Şub

😼SMOL DATA ALERT! 😼Anouncing SMOL, a professionally-translated dataset for 115 very low-resource languages! Paper: arxiv.org/pdf/2502.12301 Huggingface: huggingface.co/datasets/googl…

English

4.2K

粵語計算語言學基礎建設組 CanCLID@Can_CLID·15 Şub

This dataset costs us ~$5000 USD. The money was spent on hiring annotators and buying the tools, and we have run out of budget. If you are interested in sponsoring or donating to CanCLID to create more datasets, such as The Heaven Sword and Dragon Saber, please reach out to us!

English

186

粵語計算語言學基礎建設組 CanCLID@Can_CLID·15 Şub

本數據集共計成本接近 $5000 美金，用於聘請標註人員同埋搭建購買工具。目前項目預算已經用完，所以暫時唔再加新數據。如果大家有意贊助或者捐助 CanCLID 繼續出品更多高質數據集，例如增加《倚天屠龍記》《鹿鼎記》等等，歡迎私信聯繫！

中文

227

粵語計算語言學基礎建設組 CanCLID@Can_CLID·15 Şub

張悦楷語音數據集最尾一個子集《走進毛澤東的黃昏歲月》已經上傳完畢，加上之前嘅三國演義同水滸傳，總共有112個鐘嘅高質語音數據喇！ The last subset of Zoeng Jyut Gaai Speech Dataset, The final days of Mao Ze Dong, is now fully uploaded. We have 112 hours now! huggingface.co/datasets/CanCL…

中文

567

粵語計算語言學基礎建設組 CanCLID@Can_CLID·20 Oca

張悦楷數據集迎來最大更新：新加咗 38.62 個鐘張悦楷講《水滸傳》，加上原有嘅三國演義數據，總時長達到 104.64 個鐘！HF 倉庫亦正式改名為 CanCLID/zoengjyutgaai huggingface.co/datasets/CanCL… 主頁亦已加入最新統計信息 canclid.github.io/zoengjyutgaai/ 請大家多多分享支持，令我哋繼續出品優質數據集！

中文

468

粵語計算語言學基礎建設組 CanCLID retweetledi

Chaakming Lau@chaakming·31 Tem

I contributed a chapter titled "Ideologically Driven Divergence in Cantonese Vernacular Writing Practices" to J-F Dupré's forthcoming book "The Politics of Language in Hong Kong", releasing Dec 2024. It is part of a new book series on Hong Kong research. routledge.com/9781032648453

English

3.4K

粵語計算語言學基礎建設組 CanCLID@Can_CLID·20 Kas

Hugging Face 倉庫入面含有：1. 所有源音頻 webm 2. 每集對應嘅字幕 srt 3. 用字幕切分並重採樣之後，適合直接用嚟做訓練數據嘅 wav 4. 由字幕集合起身嘅總數據文件 metadata.csv 如果唔識點樣用 git 或者 Hugging Face 下載，歡迎留言提問。

中文

175

粵語計算語言學基礎建設組 CanCLID@Can_CLID·20 Kas

作為用嚟示範，呢個係用本數據集訓練出嚟嘅 TTS （語音合成）模型，你可以用楷叔把聲嚟講你想聽嘅嘢！ As an example use case, this is a TTS demo trained with this dataset. You read anything aloud with Zoeng Jyut Gaai's voice! huggingface.co/spaces/laubong…

中文

217

粵語計算語言學基礎建設組 CanCLID@Can_CLID·20 Kas

張悦楷講古語音數據集正式完工！總共 66 個鐘嘅高質粵語語音數據，就算唔用嚟整 AI 技術都可以直接下載 webm 同 srt 字幕落嚟當故仔噉聽。亦都可以用嚟做語言學、文學研究。數據集主頁： canclid.github.io/zoengjyutgaai/ The Zoeng Jyut Gaai dataset is officially released! 66 hours of high quality data!

中文

676

粵語計算語言學基礎建設組 CanCLID@Can_CLID·10 Kas

免費粵文字幕SRT生成器！準過Subanana！請大家多多分享傳播！ Free Cantonese subtitles generator! Please share and spread the word! huggingface.co/spaces/laubong…

中文

890

粵語計算語言學基礎建設組 CanCLID@Can_CLID·9 Kas

@feilung_lau README入面有寫步驟，首先你台電腦要有python，然後跑pip install -r requirements.txt，然後再跑個 python cli.py input.mp3

日本語

劉飛龍@feilung_lau·9 Kas

@Can_CLID 請問點運行？我打包下載左個項目，唔識運行

中文

粵語計算語言學基礎建設組 CanCLID@Can_CLID·9 Kas

目前最好用嘅粵文字幕生成器，輸入音頻（.mp3 .wav 等等）自動出 SRT文件。免費開源，準過 subanana！歡迎外部貢獻同意見反饋！ State-of-the-art Cantonese subtitles generator, more accurate than Subanana! Contributions and feedback welcomed! github.com/hon9kon9ize/yu…

中文

722

粵語計算語言學基礎建設組 CanCLID@Can_CLID·4 Kas

@si_pbc whats the language distribution and domain distribution of these 20 million hours of data? Is it English only or multilingual? Are they all podcasts from YouTube or it has music, background noise etc as well?

English

Standard Intelligence@si_pbc·4 Kas

Hertz-dev is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. We’ve released checkpoints and code for both mono and full-duplex generation on our website under the Apache license.

English

6.5K

Standard Intelligence@si_pbc·4 Kas

At Standard Intelligence we’ve been researching scalable cross-modality learning. We’re excited to share some early results in the form of 𝗵𝗲𝗿𝘁𝘇-𝗱𝗲𝘃, an open-source, first-of-its-kind base model for full-duplex conversational audio. 1/

English

132

830

177.7K

粵語計算語言學基礎建設組 CanCLID@Can_CLID·19 Eyl

commonvoice.mozilla.org/yue/datasets

ZXX

174

粵語計算語言學基礎建設組 CanCLID@Can_CLID·19 Eyl

Common Voice 19.0 已經發佈，粵語有 209 個鐘嘅驗證錄音喇！多謝晒大家嘅支持！呢啲增長嘅數據量好快就會喺下游嘅語音應用中體現出嚟，期待更多高質嘅粵語語音模型出現！ Common Voice 19.0 is released and has 209 validated hours of Cantonese data! Better Cantonese voice models are coming!

中文

777

粵語計算語言學基礎建設組 CanCLID@Can_CLID·13 Ağu

呢個數據集取材自張悦楷，已故廣州最出名嘅講古佬 zh.wikipedia.org/wiki/%E5%BC%A0… 最出名嘅一部作品《三國演義》。呢個數據集唔單只可以用嚟做TTS，仲可以做 ASR 測試集或者語音模型預訓練數據集，例如github.com/AlienKevin/can…。我哋嘅目標係整晒157集總共超過70個鐘嘅錄音，麻煩大家多多支持！

中文

458

粵語計算語言學基礎建設組 CanCLID@Can_CLID·13 Ağu

CanCLID 最新作品，目前全網唯一免費開源嘅粵語 TTS 數據集，張悦楷講三國演義，隆重登場：huggingface.co/datasets/laubo… 呢個數據集啱啱開工目前得 55 分鐘。如果你想幫手、加速擴展數據集嘅話歡迎聯繫我哋！ New open-sourced Cantonese TTS dataset available now! Contact us if you want to help!

中文

2.6K

Keşfet

@feilung_lau @si_pbc @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA