Junyang Lin

3.2K posts

Junyang Lin

@JustinLin610

❤️ 🍵 ☕️ 🍷 🥃 🍺

Katılım Aralık 2015

2K Takip Edilen87.1K Takipçiler

Junyang Lin@JustinLin610·24 Nis

why preview

DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English

343

81.3K

Junyang Lin@JustinLin610·22 Nis

@ysu_nlp @NeoCognition cong to u yu! a brilliant startup idea about specialized agents

English

Yu Su@ysu_nlp·21 Nis

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

English

134

875

174.6K

Junyang Lin@JustinLin610·13 Nis

i do like this passage, and here are some thoughts: 1. critical thinking is essential in the era of agents. i still remember that many years ago when i studied the lesson of critical thinking, i learned that keeping debating with yourself by listing out reasons can really deepen your thinking. today, critical thinking becomes humans debating with agents, so that they can think more deeply together and analyze problems in a more comprehensive way. 2. designing a healthy and well-structured organization and system is essential for creation and building. with systematic support and efficient tooling, humans can work exponentially more effectively together with agents. that gives people more time to take care of their physical and mental health, while also exploring new opportunities. 3. new era often favors newbies, because they have less past experience and therefore less fear of current difficulties. what oldbies should really think about is which parts of their experience are actually worth leveraging. from my perspective, we should think more carefully about which experiences are truly aligned with first principles. but anyway, ai first is super, super exciting!

Peter Pang@intuitiveml

x.com/i/article/2043…

English

245

55.4K

Junyang Lin@JustinLin610·12 Nis

we need agent evals that are really consistent with real world usages. otherwise people are optimizing foundation models for the wrong direction. the problem of targeting is even bigger than benchmaxxing.

English

239

27.5K

Junyang Lin retweetledi

Dawn Song@dawnsongtweets·10 Nis

x.com/MogicianTony/s… 🧵 1/ Our agent Terminator-1 scored ~100% on 8 major AI agent benchmarks, e.g., SWE-bench Verified & Pro, Terminal-Bench, beating Claude Mythos. It solved 0 tasks. Benchmarks are the field's shared language for measuring AI progress. Our new work shows that language is broken. Here’s how.

Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

334

89.2K

Junyang Lin@JustinLin610·8 Nis

@elonmusk beat mythos?

English

11.1K

Elon Musk@elonmusk·8 Nis

SpaceXAI Colossus 2 now has 7 models in training: - Imagine V2 - 2 variants of 1T - 2 variants of 1.5T - 6T - 10T Some catching up to do.

English

6.7K

8.1K

68.7K

28.3M

Junyang Lin@JustinLin610·8 Nis

happy horse is insanely happy

Chetaslua@chetaslua

🚨 Happy Horse First Output This model beats seedance 2 on artificial analysis for more information check quoted tweet

English

24.9K

Junyang Lin@JustinLin610·8 Nis

unbelievable...

Anthropic@AnthropicAI

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

166

35.2K

Junyang Lin@JustinLin610·7 Nis

@Presidentlin @Zai_org i am enjoying the culture and sightseeing of my country so much

English

1.5K

Lincoln 🇿🇦@Presidentlin·7 Nis

@JustinLin610 @Zai_org Still wild seeing you on the streets, brother. How is life treating you? It seems like you are on a small vacation.

English

1.5K

Z.ai@Zai_org·7 Nis

Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations. Blog: z.ai/blog/glm-5.1 Weights: huggingface.co/zai-org/GLM-5.1 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Coming to chat.z.ai in the next few days.

English

549

1.3K

10.9K

4.3M

Junyang Lin@JustinLin610·7 Nis

tokenmaxxing vs. ironmaxxing lol. it should be an era where results matter but it seems not.

English

14.7K

Junyang Lin@JustinLin610·5 Nis

ZXX

150

14.8K

Junyang Lin@JustinLin610·3 Nis

mountain climbing is so funny

English

108

17.6K

Junyang Lin@JustinLin610·1 Nis

@percyliang Congratulations!

English

2.6K

Percy Liang@percyliang·1 Nis

Academic titles are funny. After 14 years, I finally have the official title that people might have always assumed I had.

English

1.3K

115.3K

Junyang Lin retweetledi

ollama@ollama·31 Mar

Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework. This change unlocks much faster performance to accelerate demanding work on macOS: - Personal assistants like OpenClaw - Coding agents like Claude Code, OpenCode, or Codex

English

292

733

5.8K

774.3K

Junyang Lin@JustinLin610·31 Mar

model+harness is now over model only. agent perf can be significantly influenced by the design and quality of harness. i do believe this is a right direction, nice work!

Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

578

78.5K

Junyang Lin@JustinLin610·27 Mar

@Yuchenj_UW capybara? seriously?

English

443

51.5K

Yuchen Jin@Yuchenj_UW·27 Mar

Anthropic’s new model, Capybara: “Compared to Claude Opus 4.6, Capybara achieves dramatically higher scores in software coding, academic reasoning, and cybersecurity.” According to Dario's previous interview, it might be a 10T-parameter model that cost $10 billion to train.

English

215

196

3.5K

623.8K

Junyang Lin@JustinLin610·26 Mar

@YouJiacheng mdga

Eesti

You Jiacheng@YouJiacheng·25 Mar

Dense model won again... (27B dense beat 397B-A17B MoE).

stevibe@stevibe

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I included Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled too. Only two models went all green: the 27B dense and the distilled 27B. The 397B? Failed two tests. The 122B? Failed one. The 35B? Failed two. The timed-out results — mostly on the smaller models, are cases where the model got stuck in a loop, repeating the same tool call until it hit the 30-second limit. The test that exposed the most models: "Search for Iceland's population, then calculate 2% of it." Simple, but 35B, 122B, and 397B all used a rounded number from memory instead of the actual search result. They didn't trust their own tool output. Small models hallucinate data. Big models ignore data. The 27B just threaded it through.

English

148

38.2K

Junyang Lin@JustinLin610·26 Mar

x.com/i/article/2037…

ZXX

595

858.3K

Junyang Lin retweetledi