Xiaoliu.x

212 posts

Xiaoliu.x

@xiaolGo

ex-algo engineer → architect → founder → researcher → ? My reading list: https://t.co/raW6A7xfMS https://t.co/b0JOLNr4bE

Katılım Ocak 2010

19 Takip Edilen127 Takipçiler

Sabitlenmiş Tweet

Xiaoliu.x@xiaolGo·6 Ara

Sometime dreaming is a thinking game, openreview.net/forum?id=HHsD9…

English

288

Xiaoliu.x@xiaolGo·20h

Current LLMs are trapped in the dilemma of incremental inovations.

English

Xiaoliu.x@xiaolGo·6d

@keonwkim for reference, can check this JEPA variant work github.com/xiaol/leworldm…

English

Keon Kim@keonwkim·6d

I created a minimal one-file implementations (160loc) of JEPA family (ijepa, vjepa, vjepa2, cjepa) for educational purposes Making things minimal and removing all the things needed for scaling the algorithm always helped me understanding. So I stripped everything but the algorithm parts. What's left is 160-200 lines of code that distills the essence of the mathematics. It is very easy to compare with the math in the paper and the code and how it can be implemented in PyTorch. I added [algo]_tutorial.md files to help with understanding. github.com/keon/jepa

English

352

20.4K

Xiaoliu.x@xiaolGo·6d

@iScienceLuvr Yes, you can diffuse the embeddings and of course, you can diffuse attention and state too, check my early work arxiv.org/abs/2601.19221

English

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·12 May

I'm a simple man, I see a Kaiming He paper, I click. ELF: Embedded Language Flows This is very interesting, getting continuous diffusion models working for text! "Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network." @sedielem you might like this one!

Tanishq Mathew Abraham, Ph.D. tweet media

English

821

153.3K

OpenMOSE@_m0se_·9 May

RWKV-Gemma-5B-E2B-Preview-0.1(maybe 0.0.1??) it's "Architecture PoC" for SWA + RWKV-7 + TICA triple-hybrid on Gemma-4. End-to-end works, Passkey holds at ~90k — that's all I'm claiming for v0.1. Real target is the 26B MoE, back to that now. huggingface.co/OpenMOSE/RWKV-…

English

3.6K

Xiaoliu.x@xiaolGo·10 May

@_m0se_ For a 90K context length, does this capability come from SFT on longer context training data?

English

124

Xiaoliu.x@xiaolGo·8 May

That's something！

antirez@antirez

Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other contributors. Thanks!

English

Xiaoliu.x retweetledi

OpenMOSE@_m0se_·4 May

RWKV Linear Gemma + Search 少しは生成できるようになってきました。

日本語

2.6K

Xiaoliu.x retweetledi

BlinkDL@BlinkDL_AI·25 Nis

RWKV-LM now trains 40% faster and with 40% less VRAM 🙂 further optimizations WIP github.com/BlinkDL/RWKV-LM

English

Xiaoliu.x@xiaolGo·25 Nis

Doing expriments has never been this easy

English

Xiaoliu.x@xiaolGo·22 Nis

@mytechceoo @albertadevs @getaxal The voice echoes in the soul.

English

Jason@mytechceoo·21 Nis

CEO obsessed with token maxxing

English

282

13K

1.9M

Xiaoliu.x@xiaolGo·20 Nis

I'll release a post-rain dataset recipe. Usually, we need standard datasets, along with new data tailored to what you want the model to learn.

OpenMOSE@_m0se_

Hello world! RWKV Gemma4 E2B RWKV hxa07i + Tiny Infused Causal Attention Prime RWKV L7 + Efficient RWKV L28 Headsize 512 -> 128 Compression

English

Xiaoliu.x retweetledi

BlinkDL@BlinkDL_AI·19 Nis

RWKV-7 G1f is here (13B/7B/3B/1B) and G1g in May. p.s. Gemma 4 is great at "uncheatable eval" confirming its effectiveness 🙂 pity there's no Qwen3.5 27B base

BlinkDL@BlinkDL_AI

RWKV-7 G1e is here (13B/7B/3B/1B). Although Qwen 3.5 is strong, we are improving every month too 🙂 G1f in April. (G1d models all released too).

English

11.6K

Xiaoliu.x@xiaolGo·19 Nis

You'll find a way if you're short on GPU power.

DAIR.AI@dair_ai

NEW paper from Apple. Interesting idea: "Attention to Mamba". The paper introduces a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance. Their trick: first distill the transformer into a linearized-attention student using a kernel adaptation, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost. If you can reliably convert trained transformers into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that decides which architectures actually get deployed in agent stacks. Paper: arxiv.org/abs/2604.14191 Learn to build effective AI agents in our academy: academy.dair.ai

English

Xiaoliu.x@xiaolGo·13 Nis

@_m0se_ try this one arxiv.org/abs/2511.10643

English

256

OpenMOSE@_m0se_·12 Nis

今日分かった発見はkl divergenceは"噓をつく" MLP凍結におけるAttentionのみ調整では迂回ネットワークが構築されklが見かけだけ下がるが、各隠れ状態アライメントは悪化。ここからは推測これにより各層の関数が崩壊し、multi-tracking能力がトレーニング後半で大幅悪化する

日本語

9.1K

Xiaoliu.x@xiaolGo·9 Nis

@alexandr_wang any incoming "codex"？

English

Alexandr Wang@alexandr_wang·8 Nis

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

728

1.2K

10.3K

4.5M

Xiaoliu.x@xiaolGo·7 Nis

In every game, you need a cash cow to build your strategy.

English

Xiaoliu.x@xiaolGo·4 Nis

@karpathy But I’m too lazy to guide the LLM to digest information in a standard way — What you propsed bridge the gap between me and the LLM, it's a niche guideline, really inspiring!

English

Xiaoliu.x@xiaolGo·4 Nis

I have read thousands of papers using this kind of technology with Manim or podcasts. It’s interesting how everything stays maintained within one project. The interactions make research sometimes feel like a gacha game, where insights can spark gracefully. You can also easily create quizzes and image overviews.

English

Andrej Karpathy@karpathy·2 Nis

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English

2.9K

7.1K

58.9K

21.1M

Xiaoliu.x retweetledi

OpenMOSE@_m0se_·3 Mar

来月あたりに、 Qwen3.5 9BのGDNレイヤーと、Attnレイヤーを RWKV 07E-NoPEに変換し、Prefix State-tuning実装と、 MRSSで、遊ぶつもり

日本語

342

Xiaoliu.x@xiaolGo·27 Şub

@AnalemmaAI

QME

Analemma@AnalemmaAI·23 Şub

🎉 After 228.5 hrs of continuous operation, FARS has completed its 100th research paper at T+228:28:33. 🚀 During this public deployment experiment, FARS consumed 11.4B tokens at a total cost of $104K and generated 244 hypotheses along the way. On average, each paper took approximately 2 hrs 17 mins to produce, consumed around 114M tokens, and cost about $1,040. 📝 We’ve used paperreview.ai (developed by the Stanford ML Group) to conduct AI-based reviews of the completed papers. The 100 papers received scores ranging from 3.0 to 6.3, with an average score of 5.05. Scores were heavily concentrated around 5.2 (the most frequent score, approximately 57 papers). A small number of papers scored between 3.0–4.5, while only a few exceeded 6.0. Research proposals, experimental code, final papers, and AI review results for all completed projects are now available on our website (analemma.ai/fars). This deployment focused on producing short-form papers and was not specifically optimized for traditional conference-style evaluation frameworks such as ICLR. Therefore, assessments from AI reviewers based on existing academic conference standards (including the Stanford Agentic Reviewer) should be considered for reference only. We’re also conducting an independent human quality assessment and will share the findings once the evaluation is complete. 📃 FARS Research Runs: analemma.ai/fars 📦 FARS GitLab: gitlab.com/fars-a #AI #LLM #research

English

4.2K

Keşfet

@keonwkim @iScienceLuvr @sedielem @_m0se_ @mytechceoo @albertadevs @getaxal @alexandr_wang