Masaki Waga (@MasWag) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Masaki Waga@MasWag·1 Oca

タイミングはあまり良くないとは思うけど、2024年頭の何かを書いた。(少なくとも当面)Twitterにはほぼいないということだけが重要 maswag.github.io/blog/posts/202…

日本語

0

11

1.7K

Masaki Waga retweetledi

ヤバイテックトーキョー@技術書典20 さ15@YabaitechTokyo·4 Nis

今回の出展をずんだもんに宣伝してもらいました！ (nyanko3141592さんのテンプレートを使わせてもらっています👉 github.com/nyanko3141592/…)

日本語

1

6

5

856

Masaki Waga retweetledi

ヤバイテックトーキョー@技術書典20 さ15@YabaitechTokyo·3 Nis

新刊の yabaitech.tokyo vol.8 がそろそろ脱稿です。今回は証明支援系自作、モデル検査、メモリモデルの三本立てです！ #技術書典 #技術書典20

日本語

1

18

33

3.9K

Masaki Waga@MasWag·27 Mar

無事にQEST+FORMATS'24のtrack chairとしての最後のお仕事であるSTTTのspecial issueも出版された模様 doi.org/10.1007/s10009…

日本語

0

1

117

Masaki Waga@MasWag·12 Mar

RT> 弊SoftMatcha 2受賞情報です

日本語

0

2

409

Masaki Waga retweetledi

E869120@e869120·12 Mar

言語処理学会 #NLP2026 で主著論文の SoftMatcha 2 が優秀賞 (797 件中上位 16 件) を獲得しました！ AI や自然言語処理に関する研究は初めてでしたが、高く評価していただき、誠にありがとうございました。

日本語

1

17

247

14.7K

Masaki Waga retweetledi

Takuya Akiba@iwiwi·19 Şub

ITmedia AI＋にてSoftMatcha 2について書いていただきました、ありがとうございます！ itmedia.co.jp/aiplus/article…

ITmedia AI＋@itm_aiplus

“あいまい”検索システム「SoftMatcha 2」　東大や京大、Sakana AIなどが開発　巨大化するAI学習データを高速検索 itmedia.co.jp/aiplus/article…

日本語

0

4

59

11.2K

Masaki Waga retweetledi

Sakana AI@SakanaAILabs·12 Şub

Introducing SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Pre-Training Corpora softmatcha.github.io/v2/ What lies within a trillion-scale pre-training corpus? Can you truly guarantee your benchmarks are uncontaminated simply because there are no exact string matches? Alongside several research institutions in Japan, Sakana AI is proud to have collaborated in the development of SoftMatcha 2, an ultra-fast and flexible search tool that enables search over trillion-scale natural language corpora in under 0.3 seconds, even while handling semantic variations (substitution, insertion, and deletion). No existing tool meets all these criteria, including infini-gram-mini (EMNLP’25 Best Paper) or the original SoftMatcha (ICLR’25). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. As a practical application, we demonstrate that SoftMatcha 2 identifies potential benchmark contamination in pre-training corpora that existing exact-match approaches miss. You can try searching through a 100B-scale corpus via our online demo. The system remains blazingly fast even on trillion-token corpora, so we encourage you to host it yourself for larger scales. Demo: …-website-ap-northeast-1.amazonaws.com Paper: arxiv.org/abs/2602.10908 Code: github.com/softmatcha/sof… This work is a collaboration with researchers from the University of Tokyo, NII, Kyoto University, SOKENDAI, NINJAL, Tohoku University, and RIKEN.

English

16

85

465

88.2K

Masaki Waga retweetledi

Takuya Akiba@iwiwi·12 Şub

We've launched SoftMatcha 2, developed with my friends in academia. It's a blazing-fast search tool for trillion-token pre-training datasets that supports semantic variants. We showcase its use in detecting benchmark contamination in our paper. 🌐 Demo: …-website-ap-northeast-1.amazonaws.com 📄 Paper: arxiv.org/abs/2602.10908 💻 Code: github.com/softmatcha/sof… 🍵 Project: softmatcha.github.io/v2/

Sakana AI@SakanaAILabs

Introducing SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Pre-Training Corpora softmatcha.github.io/v2/ What lies within a trillion-scale pre-training corpus? Can you truly guarantee your benchmarks are uncontaminated simply because there are no exact string matches? Alongside several research institutions in Japan, Sakana AI is proud to have collaborated in the development of SoftMatcha 2, an ultra-fast and flexible search tool that enables search over trillion-scale natural language corpora in under 0.3 seconds, even while handling semantic variations (substitution, insertion, and deletion). No existing tool meets all these criteria, including infini-gram-mini (EMNLP’25 Best Paper) or the original SoftMatcha (ICLR’25). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. As a practical application, we demonstrate that SoftMatcha 2 identifies potential benchmark contamination in pre-training corpora that existing exact-match approaches miss. You can try searching through a 100B-scale corpus via our online demo. The system remains blazingly fast even on trillion-token corpora, so we encourage you to host it yourself for larger scales. Demo: …-website-ap-northeast-1.amazonaws.com Paper: arxiv.org/abs/2602.10908 Code: github.com/softmatcha/sof… This work is a collaboration with researchers from the University of Tokyo, NII, Kyoto University, SOKENDAI, NINJAL, Tohoku University, and RIKEN.

English

1

10

52

14.7K

Masaki Waga@MasWag·12 Şub

RT> ICLR'25の仕事がより強くなって帰ってきたの巻です。個人的にはずっとやっているqualitative/Booleanな世界をquantitativeにするという研究の方向性の、高速文字列マッチングでの第二弾という位置づけだったりします

日本語

0

9

369

Masaki Waga retweetledi

E869120@e869120·12 Şub

【告知】超大規模コーパスから類似語を含めた用例検索を行えるツール「SoftMatcha 2」を公開しました！検索ツールとしては国内最大級の約 2,600 億語に対し、わずか 0.1 秒で検索。英語・日本語含めた 7 つの言語に対応しています。ぜひご活用ください！ softmatcha.github.io/v2/

日本語

0

95

454

49.3K

Masaki Waga retweetledi

sho_yokoi@sho_yokoi·12 Şub

🍦 SoftMatcha 2 プロジェクトページ： softmatcha.github.io/v2/ 🗣️ 今週末 2/14 の #言語学フェスと、それから 3/10 に #NLP2026 でも発表します。遊びにきてください。 (言語学フェス) sites.google.com/view/lingfes20… soft-monarch-ccb.notion.site/A-12-2e50d922f… (NLP) anlp.jp/nlp2026/ #Q1-5" target="_blank" rel="nofollow noopener">anlp.jp/proceedings/an…

日本語

1

51

203

30.7K

Masaki Waga retweetledi

sho_yokoi@sho_yokoi·12 Şub

1兆語規模のコーパスから0.1秒単位で用例検索できるツールができてしまいました。意味的な置換・挿入・削除にも対応。世界の Takuya Akiba と ICPC 史上初世界2位に輝いた E869120 のガチプロ2名にジョインいただき、動くわけがないと思っていたサイズでなぜか動いてます。遊んでみてください。

日本語

1

446

2.1K

432.9K

Masaki Waga retweetledi

Takuya Akiba@iwiwi·12 Şub

巨大なLLM事前学習データを爆速で検索出来る「SoftMatcha 2」の開発に参加させてもらいました。デモ、論文、ソースコード等をこの度公開しましたので是非お試し下さい！ softmatcha.github.io/v2/ 意味的類似性に基づいた置換や挿入削除に対応しながら1兆トークン規模のデータを0.1秒代で検索するというなかなか狂った性能になってます。EMNLP'25 Best Paperのinfini-gram-miniを含む既存のツール全てを大きく凌駕する性能だと思います。用途に特化したデータレイアウトを持つdisk-aware suffix arrayを使いながら、本来指数的になる置換・挿入・削除の候補を実データに基づきうまく枝刈りすることで高速な検索を達成してます。この規模の事前学習データを検索出来ることの利点の事例として、論文ではベンチマークの汚染の検証をやってみてます。infini-gram-miniのような厳密な検索のみでは発見出来ないような汚染の事例なども有りそうでした。現在デモでは数百Bトークン規模のデータからの検索を試せるようになってます。コードも公開してますのでご自身でホストしてもらうとより大規模なケースもお試し頂けます。 🌐 Demo: …-website-ap-northeast-1.amazonaws.com 📄 Paper: arxiv.org/abs/2602.10908 💻 Code: github.com/softmatcha/sof… 若き才能 @e869120 を始めとするSoftMatchaチームの方々との協働はとても刺激的で多くの学びがありました。楽しかった〜！ありがとうございました！ @shiatsumat @go2oo2 @ksuenaga @MasWag @sho_yokoi

sho_yokoi@sho_yokoi

1兆語規模のコーパスから0.1秒単位で用例検索できるツールができてしまいました。意味的な置換・挿入・削除にも対応。世界の Takuya Akiba と ICPC 史上初世界2位に輝いた E869120 のガチプロ2名にジョインいただき、動くわけがないと思っていたサイズでなぜか動いてます。遊んでみてください。

日本語

4

256

1.2K

247.9K

Masaki Waga retweetledi

田中一敏 Kazutoshi TANAKA@sports_robots·31 Ara

#クラウドファンディングCAMPFIRE にて、研究者の業績管理を楽にするサービス PubListAuto の支援募集を開始しました。国内外、大小問わず、論文等の自分の発表情報を網羅的にウェブから集約して、利用可能にするサービスです。ご支援よろしくお願いします！ camp-fire.jp/projects/90761…

日本語

0

32

36

18.4K

Masaki Waga@MasWag·17 Ara

ついでに、同12/15にはFSTTCSの併設ワークショップのQuantFormalで大体これに対応するトークをやってきました (quantformal-2025.vercel.app) スライドも公開されてます

日本語

0

264

Masaki Waga@MasWag·15 Ara

丁度一年くらいまえに初稿をざっと書いたブラックボックス検査のtutorial論文が一般公開されたので、宣伝しときます。実は始めて書いた日本語論文だったりします doi.org/10.11509/iscie…