Lingjun_C

12 posts

Lingjun_C

Lingjun_C

@DDDDDomain

RA @ NUS; PKU 2026

Katılım Eylül 2024
43 Takip Edilen13 Takipçiler
Lingjun_C retweetledi
LobeHub
LobeHub@lobehub·
Introducing LobeHub: Agent teammates that grow with you. LobeHub is the ultimate space for work and life: to find, build, and collaborate with agent teammates that grow with you. We’re building the world’s first and largest human–agent co-evolving network. Two years ago, we built LobeChat, an open-source interface for using different AI models. Today, LobeChat has 70k+ GitHub stars and serves 6M+ users worldwide. How to fully unlock the power of models has always been a shared mission between us and the community. We started with interaction — a fundamentally new, agent-first experience. Agents are no longer passive tools invoked in a single conversation. They should be proactive, always-on units of work. Treating agents as the minimal atomic unit is also the core of our agent harness infra. Today’s agents are mostly one-off executors. Even with memory, it’s often global — and hallucinates. We build long-term agent teammates that evolve with users. Each agent has its own dedicated memory space, editable by users, allowing humans and agents to co-evolve over time. This, in turn, allows us to design clearer rewards for reinforcement learning and create cleaner environments for continual learning. Agent teammates can work in groups. Through a multi-agent system, agent groups operate faster, more cost-effective, and go beyond what single-agent systems can achieve. For example, a single agent often requires heavy user involvement to proceed step by step, whereas LobeHub can execute the same work from a single instruction, with a supervisor orchestrating agents that run in parallel or debate to produce better results. We are building the collaboration network among agent teammates — and between humans and agent teammates as well. Ease of use matters. AI intelligence and shared human intelligence are equally important. With simple instructions and tool selection, you can effortlessly build and team up with agent coworkers to deliver complex, systematic work — even assembling a quant team to execute trades. Through the LobeHub community, anyone can discover, reuse, and remix agents and agent groups, customizing them to fit their own workflows, preferences, and needs. Last but not least, our vision started with LobeChat: multi-model support is the most efficient approach for users. We believe different models excel in different scenarios. By routing across multiple models, LobeHub improves cost efficiency and unlocks capabilities that a single-model setup cannot easily support.
English
82
69
322
183.8K
Lingjun_C retweetledi
Eval Sys
Eval Sys@EvalSysOrg·
MCPMark Leaderboard Update 🚀 🌟 DeepSeek-V3.2-thinking jumps to the #1 spot among open-source models — and we’re honored to see MCPMark cited in the @deepseek_ai technical report. ⚡️ Gemini 3 Pro High @GoogleDeepMind now leads with the highest pass@1 and pass@4 success rates. This update brings two newly released models onto the leaderboard: Gemini 3 and DeepSeek V3-2.
Eval Sys tweet mediaEval Sys tweet media
English
2
8
12
1.4K
Lingjun_C retweetledi
Jiawei Gu
Jiawei Gu@Kuvvius·
🚨Sensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph — thinking in modalities, not just with them. And what we found was... unexpected. 👀 Emergent intelligence, strong gains, and …🫣 🧵 arxiv.org/abs/2510.27492 (1/16)
Jiawei Gu tweet media
English
27
65
316
68.7K
Lingjun_C retweetledi
Jinjie Ni
Jinjie Ni@NiJinjie·
More repeats = more intelligence 🧬 We scaled up the crossover runs to 1.5 trillion tokens, with 10B unique. The result? 😵 A clear crossover — and a strong 1.7B coder — without any fancy tricks. We wrote a full paper on when and how diffusion language models surpass AR models, with 360° in-depth insights. Paper (main url): jinjieni.github.io/dlms-are-super… Paper (backup url): gitee.com/JinjieNi/dlms-… GitHub: github.com/JinjieNi/dlms-… 🧵 1/7
Jinjie Ni tweet media
Jinjie Ni@NiJinjie

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks. > No saturation: more repeats = more gains. 🚨 ”x.openreview.net” We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review! 🔗 Blog & details: jinjieni.notion.site/Diffusion-Lang… 18 🧵s ahead:

English
6
36
200
32K
Lingjun_C retweetledi
Michael Qizhe Shieh
Michael Qizhe Shieh@michaelqshieh·
Your agent can call tools; can it close the loop ? We stress-tested MCP with 127 CRUD-heavy tasks across 5 MCPs and >30 models, using a minimal but general MCPMark-Agent for fair comparison. 📄 Paper: arxiv.org/pdf/2509.24002 🌐 Website: mcpmark.ai 💻 Code: github.com/eval-sys/mcpma… 🤗 Daily Papers: huggingface.co/papers/2509.24… GPT-5 reaches 52.56% pass@1 and 33.86% pass^4, yet widely regarded strong models such as claude-sonnet-4 and o3 remain below 30% pass@1 and 15% pass^4. The newest Claude-sonnet-4.5 improves to 32.1% pass@1 and 16.5% pass^4 — just crossing the 30% line. The full report dives into data distributions, failure modes, and case studies (PASS vs FAIL). Plus trajectory explorer to debug agents yourself. 👉 Our leaderboard already tracks by models and MCP servers, and will soon support agent submissions — we welcome the community to submit results! Key insights in thread ⬇️
Michael Qizhe Shieh tweet media
English
2
21
57
11.8K
Lingjun_C retweetledi
Qwen
Qwen@Alibaba_Qwen·
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall 🔹 Ultra-sparse MoE: 512 experts, 10 routed + 1 shared 🔹 Multi-Token Prediction → turbo-charged speculative decoding 🔹 Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context 🧠 Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. 🧠 Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking. Try it now: chat.qwen.ai Blog: qwen.ai/blog?id=4074cc… Huggingface: huggingface.co/collections/Qw… ModelScope: modelscope.cn/collections/Qw… Kaggle: kaggle.com/models/qwen-lm… Alibaba Cloud API: #c5414da58bjgj" target="_blank" rel="nofollow noopener">alibabacloud.com/help/en/model-…
Qwen tweet media
English
173
686
4.1K
929.1K
Lingjun_C retweetledi
Eval Sys
Eval Sys@EvalSysOrg·
MCPMark Leaderboard Update 🚀 🌟 Qwen-3-Coder takes the #1 spot among open-source models, with an impressive per-run cost of just $36.46. ⚡️ Grok-Code-Fast-1 delivers the lowest per-run cost ($16.08) and the fastest average agent time (156.63s) across the top 10 models. Kimi-K2-0905 outperforms Kimi2 in success rate, though at nearly double the per-run cost and average agent time. Notably, Qwen-3-Coder achieves a success rate close to O3, but at roughly one-third the per-run cost — offering the community a highly cost-effective option for MCP tool-use applications. This update introduces three newly released models to the leaderboard: Qwen-3-Max, Grok-Code-Fast-1, and Kimi-K2-0905.
Eval Sys tweet mediaEval Sys tweet mediaEval Sys tweet media
English
5
21
133
94.5K
Lingjun_C
Lingjun_C@DDDDDomain·
🚀 🚀Just launched MCPMark, a challenging MCP benchmark I participated in. Its filesystem section include ops on files, structure exploration, reasoning, and multi-skill tasks. Most models show clear room for improvement, while GPT series excel in precise text manipulation
Michael Qizhe Shieh@michaelqshieh

Introducing MCPMark, a collaboration with @EvalSysOrg and @lobehub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the other models fall in the range of 10-30%. - Diverse test cases on Notion, Github, Filesystem, Playwright (browser), and Postgres. 9🧵s ahead

English
0
2
7
800
Lingjun_C retweetledi
Michael Qizhe Shieh
Michael Qizhe Shieh@michaelqshieh·
Introducing MCPMark, a collaboration with @EvalSysOrg and @lobehub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the other models fall in the range of 10-30%. - Diverse test cases on Notion, Github, Filesystem, Playwright (browser), and Postgres. 9🧵s ahead
Michael Qizhe Shieh tweet media
English
4
50
169
160.3K
Lingjun_C retweetledi
Michael Qizhe Shieh
Michael Qizhe Shieh@michaelqshieh·
To me, diffusion LMs work because they remove unnecessary inductive biases. The left-to-right inductive bias is natural for human but is unlikely to be natural for AI. This gives more capacity to our models like Transformer having a bigger capacity than LSTM. Our experiment results show diffusion outperforms autoregressive in big margins. We might enter a new paradigm if this trend holds for big models.🎅
Jinjie Ni@NiJinjie

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks. > No saturation: more repeats = more gains. 🚨 ”x.openreview.net” We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review! 🔗 Blog & details: jinjieni.notion.site/Diffusion-Lang… 18 🧵s ahead:

English
12
24
251
44.2K