粵語計算語言學基礎建設組 CanCLID

95 posts

粵語計算語言學基礎建設組 CanCLID

粵語計算語言學基礎建設組 CanCLID

@Can_CLID

致力於書面粵語推廣、粵拼推廣、粵語 NLP 技術開發、粵語語料庫建設、粵語教學資源建設。 聯繫郵箱:[email protected] Cantonese Computational Linguistics Infrastructure Development Workgroup

Canton,Hong Kong,San Francisco Katılım Ocak 2022
21 Takip Edilen445 Takipçiler
Sabitlenmiş Tweet
粵語計算語言學基礎建設組 CanCLID
It has come to our attention that many researchers and developers are confused by the existence of both zh-hk (Hong Kong Chinese) and yue (Cantonese) on Common Voice, and don't know which one to use. Our short answer is: use yue, and do NOT use zh-hk. Reasons below.
English
2
7
24
2.3K
粵語計算語言學基礎建設組 CanCLID
CanCLID 最新作品,粵語辭典匯聚網站,粵語辭叢正式上線 jyutjyu.com 目前收錄 11 本粵語詞典總共超過 26 萬個詞條,之後仲會陸續增加,歡迎捉錯反饋意見。希望幫到全世界嘅粵語老師同學生🥳
中文
0
8
24
778
粵語計算語言學基礎建設組 CanCLID
張悦楷語音數據集再次大幅更新:新加咗 75.71 個鐘嘅《鹿鼎記》數據,而家總數據量已經達到 188.25 個鐘喇!作為效果展示,我哋訓練咗一個張悦楷 TTS 系統,大家只要開一個Hugging Face 賬户就可以免費任玩!huggingface.co/spaces/laubong…
中文
1
0
6
232
粵語計算語言學基礎建設組 CanCLID
Our Zoeng Jyut Gaai speech dataset has 126k downloads last month😱🤩🥳 One of the top-100 most downloaded datasets on Hugging Face! We appreciate everyone's support and more updates are on the way! 張悦楷語音數據集上個月有 12.6 萬次下載,係 HF 前一百下載量數據集之一!
粵語計算語言學基礎建設組 CanCLID tweet media粵語計算語言學基礎建設組 CanCLID tweet media
中文
0
2
10
778
粵語計算語言學基礎建設組 CanCLID
This dataset costs us ~$5000 USD. The money was spent on hiring annotators and buying the tools, and we have run out of budget. If you are interested in sponsoring or donating to CanCLID to create more datasets, such as The Heaven Sword and Dragon Saber, please reach out to us!
English
0
0
1
186
粵語計算語言學基礎建設組 CanCLID
本數據集共計成本接近 $5000 美金,用於聘請標註人員同埋搭建購買工具。目前項目預算已經用完,所以暫時唔再加新數據。如果大家有意贊助或者捐助 CanCLID 繼續出品更多高質數據集,例如增加《倚天屠龍記》《鹿鼎記》等等,歡迎私信聯繫!
中文
1
0
1
227
粵語計算語言學基礎建設組 CanCLID
張悦楷語音數據集最尾一個子集《走進毛澤東的黃昏歲月》已經上傳完畢,加上之前嘅三國演義同水滸傳,總共有112個鐘嘅高質語音數據喇! The last subset of Zoeng Jyut Gaai Speech Dataset, The final days of Mao Ze Dong, is now fully uploaded. We have 112 hours now! huggingface.co/datasets/CanCL…
中文
1
1
16
567
粵語計算語言學基礎建設組 CanCLID retweetledi
Chaakming Lau
Chaakming Lau@chaakming·
I contributed a chapter titled "Ideologically Driven Divergence in Cantonese Vernacular Writing Practices" to J-F Dupré's forthcoming book "The Politics of Language in Hong Kong", releasing Dec 2024. It is part of a new book series on Hong Kong research. routledge.com/9781032648453
Chaakming Lau tweet media
English
3
13
43
3.4K
粵語計算語言學基礎建設組 CanCLID
Hugging Face 倉庫入面含有:1. 所有源音頻 webm 2. 每集對應嘅字幕 srt 3. 用字幕切分並重採樣之後,適合直接用嚟做訓練數據嘅 wav 4. 由字幕集合起身嘅總數據文件 metadata.csv 如果唔識點樣用 git 或者 Hugging Face 下載,歡迎留言提問。
中文
0
0
0
175
粵語計算語言學基礎建設組 CanCLID
張悦楷講古語音數據集正式完工!總共 66 個鐘嘅高質粵語語音數據,就算唔用嚟整 AI 技術都可以直接下載 webm 同 srt 字幕落嚟當故仔噉聽。亦都可以用嚟做語言學、文學研究。數據集主頁: canclid.github.io/zoengjyutgaai/ The Zoeng Jyut Gaai dataset is officially released! 66 hours of high quality data!
中文
1
3
27
676
劉飛龍
劉飛龍@feilung_lau·
@Can_CLID 請問點運行?我打包下載左個項目,唔識運行
中文
1
0
0
95
粵語計算語言學基礎建設組 CanCLID
目前最好用嘅粵文字幕生成器,輸入音頻(.mp3 .wav 等等)自動出 SRT文件。免費開源,準過 subanana!歡迎外部貢獻同意見反饋! State-of-the-art Cantonese subtitles generator, more accurate than Subanana! Contributions and feedback welcomed! github.com/hon9kon9ize/yu…
中文
1
6
40
722
粵語計算語言學基礎建設組 CanCLID
@si_pbc whats the language distribution and domain distribution of these 20 million hours of data? Is it English only or multilingual? Are they all podcasts from YouTube or it has music, background noise etc as well?
English
0
0
1
91
Standard Intelligence
Standard Intelligence@si_pbc·
Hertz-dev is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. We’ve released checkpoints and code for both mono and full-duplex generation on our website under the Apache license.
English
2
2
77
6.5K
Standard Intelligence
Standard Intelligence@si_pbc·
At Standard Intelligence we’ve been researching scalable cross-modality learning. We’re excited to share some early results in the form of 𝗵𝗲𝗿𝘁𝘇-𝗱𝗲𝘃, an open-source, first-of-its-kind base model for full-duplex conversational audio. 1/
English
52
132
830
177.7K
粵語計算語言學基礎建設組 CanCLID
Common Voice 19.0 已經發佈,粵語有 209 個鐘嘅驗證錄音喇!多謝晒大家嘅支持!呢啲增長嘅數據量好快就會喺下游嘅語音應用中體現出嚟,期待更多高質嘅粵語語音模型出現! Common Voice 19.0 is released and has 209 validated hours of Cantonese data! Better Cantonese voice models are coming!
粵語計算語言學基礎建設組 CanCLID tweet media
中文
1
7
25
777
粵語計算語言學基礎建設組 CanCLID
CanCLID 最新作品,目前全網唯一免費開源嘅粵語 TTS 數據集,張悦楷講三國演義,隆重登場:huggingface.co/datasets/laubo… 呢個數據集啱啱開工目前得 55 分鐘。如果你想幫手、加速擴展數據集嘅話歡迎聯繫我哋! New open-sourced Cantonese TTS dataset available now! Contact us if you want to help!
中文
1
5
16
2.6K