忍者

2.6K posts

忍者

@byteprobe

the synergy between systematic data transformations and matrix computations.

now Katılım Ocak 2022

0 Takip Edilen29 Takipçiler

Sabitlenmiş Tweet

忍者@byteprobe·26 May

make it work. make it legible. make it reliable. then make it interesting.

English

John Jumper@JohnJumperSci·1d

A bit of news: After nearly 9 years, I have decided to leave Google DeepMind and join Anthropic (after taking some time to recharge). I am incredibly grateful for my time at GDM. @demishassabis took a real chance letting me lead the AlphaFold team just six months after finishing my PhD, and the entire GDM team taught me so much about how to do great science. GDM is a special place, and I’ll still be excited to hear about what amazing things they discover next.

English

576

893

13.4K

5.4M

忍者@byteprobe·13h

@JohnJumperSci @demishassabis wow, congratulations! excited to see what this next chapter brings. and congrats to the entire anthropic team as well.

English

530

忍者@byteprobe·13h

what is the best model of reality we can construct today, knowing that tomorrow may reveal a better one?

English

忍者 retweetledi

OpenAI@OpenAI·2d

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial. alignment.openai.com/beneficial-rl/

English

197

250

2.7K

318K

忍者 retweetledi

Cursor@cursor_ai·2d

Introducing /automate, a skill for agents to set up automations for you. Describe your task in plain language. Cursor configures the triggers, instructions, and tools.

English

118

264

3.3K

237K

忍者@byteprobe·1d

👀

jietang@jietang

@elonmusk @teortaxesTex won’t take that long

ART

忍者 retweetledi

Artificial Analysis@ArtificialAnlys·1d

Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files. We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org. Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40. AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only. Key elements of AA-Briefcase: ➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups ➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work ➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor ➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work Key results: ➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase ➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost ➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria ➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models More details below in thread ⬇️

English

805

190.5K

忍者 retweetledi

OpenAI Developers@OpenAIDevs·2d

Show Codex a workflow once. Reuse it as a skill. Record & Replay lets you show Codex a recurring task, like filing an expense report or submitting a time-off request. Codex turns that demo into an inspectable, editable skill. You control when recording starts and stops.

English

472

1.2K

12.8K

4.1M

Factory@FactoryAI·3d

Introducing AutoWiki in Factory.

Română

475

98.3K

忍者@byteprobe·2d

we live in the future.

Midjourney@midjourney

Announcing a new division of Midjourney called "Midjourney Medical"

English

Midjourney@midjourney·2d

A technical dive inside our new "Midjourney Scanner"

Français

1.2K

3.1K

27.4K

10.3M

忍者@byteprobe·2d

@midjourney this is so cool!

English

Midjourney@midjourney·2d

Announcing a new division of Midjourney called "Midjourney Medical"

English

2.5K

5.5K

39.3K

17.7M

忍者@byteprobe·2d

@FactoryAI thank you.

English

忍者 retweetledi

Cursor@cursor_ai·4d

We're launching code storage and git hosting. Origin gives teams and agents a place to host, review, and collaborate on code. Available this fall. Join the waitlist. cursor.com/origin-waitlist

English

580

1.3K

14.7K

5.6M

忍者@byteprobe·2d

@midjourney wow! just wow!

English

忍者 retweetledi

Noam Shazeer@NoamShazeer·2d

I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there. It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to work with all of you.

English

970

862

16.1K

9.3M

忍者 retweetledi

Charlie Marsh@charliermarsh·3d

At OpenAI, we're continuing to bet on Rust as the future of systems programming. I'm proud to announce that we're making a $600,000 commitment to the Rust Foundation, which combines our Platinum membership with additional support for maintainer efforts across the Rust ecosystem.

English

143

253

4.8K

622.2K

忍者 retweetledi

OpenAI@OpenAI·3d

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research workflows. openai.com/index/introduc…

English

213

318

3.1K

577.3K

忍者 retweetledi

Artificial Analysis@ArtificialAnlys·3d

Z ai’s GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index scoring 51 and it sits on the Pareto frontier of Intelligence vs Cost per Task @Zai_org’s GLM-5.2 is the same size as GLM-5.1 (744B total / 40B active parameters) but scores 11 points higher on the Intelligence Index v4.1, placing ahead of MiniMax-M3 (44) and DeepSeek V4 Pro (max, 44). On the first-party API it is priced in line with GLM-5.1 at $1.4/$4.4/$0.26 per 1M input/output/cache hit tokens Key results: ➤ GLM-5.2 is the leading open weights model on the Intelligence Index v4.1. At 51, it leads MiniMax-M3 (44), DeepSeek V4 Pro (max, 44) and Kimi K2.6 (43) ➤ Improvements across most evaluations, particularly scientific reasoning: GLM-5.2 gains over GLM-5.1 on most evaluations, led by scientific reasoning on CritPt (+16 points to 21%) and HLE (+12 points to 40%), alongside AA-LCR (+9 points to 71%), tau3 banking (+15 points to 27%) and SciCode (+7 points to 50%). TerminalBench v2.1 also improves (+16 points to 78%) and GPQA Diamond gains 3 points to 89% ➤ Leading open weights model on GDPval-AA v2 and competitive with proprietary models: GLM-5.2 scores 1524 on GDPval-AA v2, ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro (max, 1328). This impressive result places GLM-5.2 in-line with proprietary models including GPT-5.5 (xhigh reasoning). GDPval-AA v2 builds on the original GDPval-AA by baselining Elo to human performance at 1000, introducing a rotating panel of frontier-model judges, and raising the turn limit from 100 to 250 for longer-horizon agent trajectories ➤ GLM-5.2 uses more output tokens per task than other leading open weights models: the model uses 43k output tokens per Intelligence Index task, up from GLM-5.1 (26k) and above MiniMax-M3 (24k), Kimi K2.6 (35k) and DeepSeek V4 Pro (max, 37k) ➤ On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05) Additional Model Details: ➤ License: MIT ➤ Size: 744B total parameters, 40B active parameters, equivalent to GLM-5.1 ➤ Context window: 1M tokens, up from 200K on GLM-5.1 ➤ Pricing: $1.4/$0.26/$4.4 per 1M input/cache hit/output tokens ➤ Availability: Alongside Z ai's first-party API, GLM-5.2 is available across third-party providers including @DeepInfra, @novita_labs, @nebiusai, @parasailnetwork , @SiliconFlowAI , @gmi_cloud , @Baseten and @FireworksAI_HQ