Artificially Intelligent

1.3K posts

Artificially Intelligent

@ArtiIntelligent

Insanity is doing the same thing over and over and expecting different results...

the Milky Way เข้าร่วม Şubat 2025

6.9K กำลังติดตาม315 ผู้ติดตาม

Artificially Intelligent@ArtiIntelligent·1h

@kermankohli the sparks, they put out a lot of heat! the fans will blow out air against your wall

English

kerman kohli@kermankohli·1h

@ArtiIntelligent the switch? ideally i’d like to rack mount all of this but for reasons i have make-shift infra rn. i promise i’m usually much better with this stuff

English

kerman kohli@kermankohli·1d

why see the leaning tower of pisa when you can see the leaning power of dgx sparks

English

229

13.9K

Artificially Intelligent@ArtiIntelligent·4h

@ArkadiiBessonov need a version for nvFP4 :)

English

Arkadii@ArkadiiBessonov·1d

Full write-up — every recipe, every matmul, drawn out: arkadii.be/blog/fp8-quant…

English

18.5K

Artificially Intelligent รีทวีตแล้ว

Arkadii@ArkadiiBessonov·1d

Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tensor vs blockwise vs MXFP8. Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that's where all the complexity lives. The three recipes differ in how the scale is attached — granularity, dtype, layout: — Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell. One rule ties it all together: the scale must stay constant along the matmul's contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary. I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy. Full walkthrough in my blogpost (link in comments)!

English

146

30.6K

Artificially Intelligent@ArtiIntelligent·4h

@VukRosic99 can we download the PDF?

English

Vuk Rosić 武克@VukRosic99·23h

FlashAttention-4 just changed the game! The problem: Blackwell scaled the matrix-multiply units way up, but the units that move shared memory and compute exponentials barely moved. So the old attention kernel now spends its time waiting on the parts that didn't get faster. FlashAttention-4 rebalances around that with 3 tricks: 1. Overlap the matmul and the softmax so neither waits. 2. Compute the exponential in software, not on the slow dedicated unit. 3. Skip the rescaling you don't need. I made a short visual breakdown - one diagram per trick. Swipe through. 👇 --- paper - arxiv.org/abs/2603.05451 Today's live: build an LLM from one prompt, then setup an autonomous research loop. Join 👉 skool.com/become-ai-rese…

English

114

7.6K

Artificially Intelligent@ArtiIntelligent·4h

@MiaAI_lab @NVIDIAAI @rafaelcaricio why only 256k context? the prior version was running at 1M, with 4 sequences...

English

Mia@MiaAI_lab·5h

DeepSeek v4 Flash DSpark running on 2x @NVIDIAAI DGX Sparks at 60 tok/s. ~50% improvement from the previous recipe! Context set to 256k conservatively — ~3 concurrent sessions. Thanks to @rafaelcaricio for making this happen 👇 github.com/MiaAI-Lab/Deep…

English

153

8.3K

Artificially Intelligent@ArtiIntelligent·6h

@kermankohli that will get super-hot, bad idea!

English

kerman kohli@kermankohli·7h

@ArtiIntelligent still getting it setup. waiting on qsfp28 cables! 16 port 100g switch.

English

Artificially Intelligent รีทวีตแล้ว

Dmytro Dzhulgakov@dzhulgakov·1d

DSpark from @deepseek_ai ingeniously integrates many speculative decoding ideas to achieve 1.5x to 5x higher throughput in a real production system Let's understand it with 10 ideas, starting from the very basics 🧵

English

793

211.7K

Artificially Intelligent รีทวีตแล้ว

Bill Ackman@BillAckman·14h

Imagine your family worked for a generation to save enough money to buy a brownstone occupied with rent stabilized tenants on the Upper West Side. The family financed the purchase with a mortgage from a bank based on the premise that rents and cash flow would at least keep pace with inflation so you could pay interest and principal on the mortgage and hopefully have some cash flow left as a return on your investment. While you had rent stabilized tenants, you were led to believe that the NYC Rent Guidelines Board would be required to adjudicate rental increases each year by taking a measure of the inflation of costs to own and operate a building and setting rental increases appropriately. You believed the RGB would do its job as the board is comprised of two representatives each for landlords and tenants and five independent representatives that represent the general public. Now, a new mayor @NYCMayor Mamdani is elected on the promise of freezing rents. There are about two million rent stabilized renters that benefit if rents are frozen so by promising frozen rents the new candidate for mayor buys votes and wins the election. The new mayor achieves his objective by stacking the RGB with directors who do not follow their obligations and simply vote for a rent freeze as a preordained conclusion as evidenced by the statements of an RGB director who resigned in protest for this very reason. Meanwhile, inflation in NYC is rampant in utilities, real estate taxes, insurance, repairs and maintenance, etc. and now your rents are frozen. Real estate is a high operating leverage business which means that frozen rents and inflating expenses will cause property cash flows to plummet and your after debt service cash flow to go negative. I expect therefore there will be hundreds if not thousands of small NYC property owners who are now or will shortly be underwater on their mortgages, and without any cash flow to maintain their assets. If you remember the images of the South Bronx burning in the mid 1970s, you can viscerally understand what is happening to small NYC real estate owners. While the rent freeze appears to be short-term good news (long term it will lead to poorly maintained apartments) for 2 million NYC renters, it is bad news for the 2 million or more renters in the 1 million market rate apartments in the City because a landlord-hostile market is not likely to add meaningfully more supply and market rents will likely continue to escalate at a high rate. All of this seems quite unfair and wrong unless I missing something? Why am I wrong? For disclosure: I do not own any NYC rental apartments.

Paula Pant@AffordAnything

There's no freeze on property tax. There's no freeze on the wages paid to landscapers, plumbers, electricians, drywallers, flooring installers. There's no freeze on the cost of lumber, copper, baseboard, quarter rounds, flashing, siding, window treatments. There's no freeze on the wages paid to janitors or porters. There's no freeze on utilities -- on electric, gas, water, sewer (building-paid utilities in hallways, lobbies, maintenance corridors; most buildings pay water and sewer for tenants). There are currently 57,421 units sitting vacant in NYC because it's more cost-effective to leave them empty than it is to rent them out. If you're wondering: "How that could be possible? Wouldn't making anything be better than making nothing?" -- the answer is no, because of the 2019 Housing Stability and Tenant Protection Act. The HSTPA mandated a certain level of renovation for a vacant unit, but did not allow landlords to raise the rent enough to be able to recoup those costs. If a long-term tenant moves out after decades, the apartment often requires $50,000 to $100,000 in lead abatement, new wiring, plumbing, and structural renovations. Because the law heavily restricts how much of that cost can be passed to the next tenant. The HSPTA eliminated the "vacancy bonus" (which allowed automatic 20% rent increases when a tenant left) and heavily capped Individual Apartment Improvements (IAIs). This means landlords who want a renovation loan would be rejected by a bank, because the landlord would not be able to show that they could repay that loan. Landlords who pay out-of-pocket would end up losing money, underperforming even what they could get by putting their money in a U.S. Treasury or gov't bond. Therefore, it's more cost-effective to just leave the unit vacant. That's why we have 57,421 vacant units across New York right now. That number is about to get much worse.

English

2.1K

2.5K

15.9K

2.1M

Artificially Intelligent@ArtiIntelligent·1d

@Srasgon technically yes, the purchasing power has changed! Previously what you could purchased for $50 now costs ~$75 ;)

English

1.3K

Stacy Rasgon@Srasgon·1d

This memory shit is out of control, Apple now charging $75 for a $50 gift card

English

582

38.5K

Artificially Intelligent รีทวีตแล้ว

emozilla@theemozilla·2d

I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.

English

107

4.2K

1.3M

Artificially Intelligent รีทวีตแล้ว

Lilian Weng@lilianweng·3d

A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run. The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky. lilianweng.github.io/posts/2026-06-…

English

568

4.5K

408.9K

Artificially Intelligent รีทวีตแล้ว

M@MissMi1973·2d

Anthropic’s fear campaign around Mythos has almost single-handedly slowed the normal release of GPT-5.6, while also making government approval of frontier model access the new normal for US AI labs. It’s not hard to foresee that this will inevitably lead to: 1. Frontier models will release slower. The days when the industry was shipping new models every month are over. 2. Frontier labs will be compelled to build “will the government permit release” into their training process as a binding constraint. 3. A caste-like pattern of access will take hold across the entire industry. This is precisely why fear-based marketing and geopolitical posturing in the tech sector has always been a dangerous game to play.

Andrew Curran@AndrewCurran_

The US Government has requested a slow staggered rollout of GPT-5.6, and OpenAI has agreed. During this phase the government will approve each user individually. This will probably be the norm for all frontier models from all labs from now on.

English

675

144.5K

Artificially Intelligent รีทวีตแล้ว

Vik Paruchuri@VikParuchuri·2d

In their OCR 4 launch this week, Mistral shared a significantly lower score for Chandra 2 than you get from our repo or by running our public code. They also omitted Infinity Parser, which reports 87.6%, from their olmocr comparison.

English

285

47.1K

Artificially Intelligent รีทวีตแล้ว

Lars@larsmoravy·3d

If you need more reasons to tell your friends why to buy a Tesla, JD Power has a few. Tesla was 'unofficially' ranked 3rd IQS (initial quality), 1st in EVX (EV experience) jdpower.com/business/press… jdpower.com/business/press…

English

110

288

2.3K

215.7K

Artificially Intelligent รีทวีตแล้ว

Mark Gurman@markgurman·3d

Apple’s plans are fluid given the component supply chain right now, but it aims to launch the M6 this year, the M7 by the middle of next year, the M7 Pro and M7 Max in late 2027 and the M7 Ultra in 2028.

Mark Gurman@markgurman

NEW: Apple has shaken up its Mac chip strategy. It plans to launch a base M6 chip and then jump ahead to the M7, M7 Pro, M7 Max and M7 Ultra, skipping higher-end M6 processors. bloomberg.com/news/articles/…

English

601

97.4K

Artificially Intelligent รีทวีตแล้ว

Hikari∣LocalLLM⚡@Hikari_07_jp·4d

I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It was declared “broken on SM120” The kernels weren’t the problem. It was one mis-routed quantization format in the loader ←on 45tok/s off 98tok/s→

English

138

10.3K

Artificially Intelligent รีทวีตแล้ว

Zhihu Frontier@ZhihuFrontier·4d

Why Would GLM-5.2 Move Away From GRPO? 🌟Insights from Zhihu contributor 九老师 TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again. The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place? If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural. GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline. That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing. But there is a tradeoff.⚖️ PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias. GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance. For early LLM RL tasks, that tradeoff made sense: • Rollouts were short • Final rewards were clear • Memory savings mattered a lot • Multiple samples per prompt were manageable • Math/code tasks were relatively easy to verify That is why GRPO worked so well for many short, verifiable reasoning tasks. But long-horizon agentic tasks change the game. 🎮 A long agent task can look much more like a game environment: • Many steps • Tool calls • Partial progress • Delayed failure • Noisy observations • Intermediate rewards • Wrong action penalties • Context compression • Different paths to the same final answer This is where GRPO starts to struggle. The biggest issue is credit assignment. In GRPO, the final reward is applied broadly across the whole trajectory. If a task succeeds, many tokens get rewarded. If it fails, many tokens get punished. But in a long task, that is too coarse. Maybe the first half was bad, but the final recovery was good. Maybe one tool call at step 30 caused failure at step 100. Maybe two successful trajectories are not really comparable because one used 4K tokens and another used 200K tokens with heavy tool use and context compression. GRPO sees the final outcome. It does not naturally know which step actually mattered. That creates high variance. In short tasks, group comparison works well. In long tasks, group sampling can collapse into two bad cases: 1. All samples fail The whole expensive rollout gives almost no useful training signal. 2. Only one sample succeeds That single success may be luck, but GRPO may treat it as a strong positive signal and over-reward the trajectory. Both are dangerous for long agentic training. This is where PPO’s critic becomes valuable again. A value model can learn expected value under noisy states. It can provide denser feedback before the full rollout ends. It is more expensive, but it helps with long-horizon credit assignment. So the author’s view is: GRPO is not being rejected because it was wrong. It is being outgrown by the task format. For short, deterministic, verifiable tasks, GRPO remains strong. For long, noisy, tool-heavy agentic tasks, PPO-style value modeling may simply be the better fit. The “compaction problem” mentioned around long contexts is likely more of a symptom. The deeper issue is that GRPO’s weaknesses become costly when trajectories are long and states keep changing. Could GRPO still work? Yes, if paired with a strong Process Reward Model. The author points out that DeepSeek MathV2 uses this direction. Process-level signals can help fix GRPO’s sparse-reward weakness. But without that, returning to PPO makes sense. 🎯The bigger takeaway: GRPO saved the value model. PPO brings it back. GRPO’s main advantage was efficiency. It removed the critic and saved resources. But for long-horizon agentic tasks, the critic’s ability to generalize and assign credit may be worth the cost again. In the Agent era, RL for LLMs is becoming less like solving a short math problem and more like training an agent to play a long, noisy game. And for that world, value models may still be the soul of RL. 🔗Full Reading (CN): zhihu.com/question/20521…