Jeff Smith

1.4K posts

Jeff Smith

@JeffSmith888

HFT automaton

Chicago, IL Katılım Ocak 2019

109 Takip Edilen229 Takipçiler

Jeff Smith@JeffSmith888·10 Eyl

@pshufb @molqtemlukni @SuchirKavi I can’t grasp the win here. Intermediate calcs still need an operand namespace (if not arch reg names), and OOO retire impls can update alias tables instead of value copies.

English

Wojciech Muła@pshufb·10 Eyl

@molqtemlukni @SuchirKavi The key point is not to escape from the OoO machinery when not needed. The registers are anyway allocated in PRF. In my architecture, the selected PRF values are copied out only on commit, "temporaries" are discarded.

English

100

Wojciech Muła@pshufb·10 Eyl

The more I learn CPUs the more obvious to me is that the idea that every instruction has to update the architectural state is stupid. There's should be a separate commit operation, that forces update of selected/whole architectural state. Then a CPU would execute "basic blocks".

English

1.6K

Jeff Smith@JeffSmith888·2 Tem

@leonard_coder @lemire What is "huge"? Simple AES-CTR gets you literal zettabits of values before reaching the "hey maybe change keys" limits for cryptographic security. I think the debate's mostly settled that MT's state space size is beyond any true practical use's fundamental needs.

English

Arnaud Carré@leonard_coder·1 Tem

@lemire probably not better regarding speed, but is there any better regarding huge period?

English

327

Daniel Lemire@lemire·1 Tem

Sebastiano is right. Do not use Mersenne Twister. It is unnecessarily slow and it is not any better than much faster alternatives.

English

5.8K

Jeff Smith@JeffSmith888·16 Haz

@wmf @lemire I've heard (and personally claimed) that n*log(n) sounded a lot more reasonable, but never sub-linear!😆 Also pretty sure that Metcalfe himself admitted it was alway just a flashy marketing claim.

English

Wes Felter@wmf·16 Haz

@lemire (Metcalfe’s Law is fake BTW. It's more like the square root.)

English

Daniel Lemire@lemire·15 Haz

Metcalfe’s Law against Brooks’ Law Guido van Rossum, Python’s creator, recently said: “We have a huge community, but relatively few people, relatively speaking, are contributing meaningfully.” This highlights a paradox. Software thrives on the network effect, or Metcalfe’s Law, where a system’s value scales with the square of its users. Linux excels because its vast user base fuels adoption, documentation, and compatibility everywhere. But larger teams don’t build better software—often the reverse. Brooks’ Law, from Fred Brooks’ The Mythical Man-Month, shows that adding people increases communication overhead, slowing progress. The Pareto Principle (80/20 rule) also applies: a small minority drives most meaningful contributions. Great software often stems from a single visionary or a small, cohesive team, not a crowd. The network effect applies primarily to users, not necessarily to creators.

English

6.8K

Jeff Smith@JeffSmith888·10 Haz

@tim_cook Sorry Tim, but whatever leadership proposed, lead, or approved of this design needs to be removed. Has Apple forgotten that mobile GUIs fundamentally require usability in adverse lighting? Forget aesthetics and computational waste, Liquid Glass will just suck in real-life usage.

English

113

Tim Cook@tim_cook·9 Haz

Expressive. Delightful. But still instantly familiar. Introducing our new software design with Liquid Glass.

English

5.4K

4.4K

45.8K

28.8M

Jeff Smith@JeffSmith888·8 Haz

@geofflangdale Is there a more rigorous alternative to “SIMD Instructions Considered Harmful”? The more prominently shared R-V/Berkeley/Patterson arguments have always seemed dangerously specious to me, but I’d rather not write off varlen due to wide audience arguments alone.

English

302

Geoff Langdale@geofflangdale·8 Haz

Still valid in 2025, imo, at least until someone posts some comprehensive comparative benchmarks.

Geoff Langdale@geofflangdale

@pkhuong

English

3.8K

Jeff Smith@JeffSmith888·23 May

@AzorFrank I’ve been very pleased with my 6700 XT powering my home workstation, but the 9060’s specs fall short of being the upgrade I hoped for. No UBR20 was disappointing but not the deal-breaker that only 3 display outputs is.

English

171

Frank Azor@AzorFrank·22 May

How many displays do you have connected to your dGPU card right now?

English

15.9K

Jeff Smith@JeffSmith888·24 Mar

@SebAaltonen @_memerao C (unfortunately IMO) chose declaration-follows-use (yay clockwise spiral rule 🙄), and I don't feel great about formatting that belies the actual grammar, so I stick with int *ptr; only for disliking it less, not for liking it more.

English

169

Sebastian Aaltonen@SebAaltonen·24 Mar

@_memerao int* myIntPtr; I want to clearly separate the type and the variable name. The type is int*, the variable name is myIntPtr.

English

127

4.5K

Bēmbɔɪ Bædkar@_memerao·23 Mar

To be or not to be is to grammarians, What 𝙞𝙣𝙩* 𝙭 or 𝙞𝙣𝙩 *𝙭 is to programmers. A matter of style, a cause for debate, Yet the compiler cares not, it still compiles fate. Are you a *𝘭𝘦𝘧𝘵 or a *𝘳𝘪𝘨𝘩𝘵 ?

English

17.1K

Jeff Smith@JeffSmith888·22 Mar

@icculus I’d be happy with just opt-in decoupling of layout and de/init ordering in C++, even if it could make even less pretty field lists. Cache line packing & init order deps can sometimes oppose each other. IOW, there’re at least 3 orthogonal needs with RAII style langs.

English

Ryan C. Gordon@icculus·21 Mar

It would be neat if there was a way to signal to the C compiler: "reorder the fields in this struct for optimal alignment/padding in a way the whole system will agree to, but let me list them in whatever order I want in the header."

English

295

22.5K

Jeff Smith@JeffSmith888·18 Mar

@geofflangdale @davidtgoldblatt Not only is 64b frequently overkill now, but AFAIK nobody even goes beyond 57b VA support yet, and LAM etc. extensions to scavenge the high bits are getting increasing attention.

English

Geoff Langdale@geofflangdale·18 Mar

@davidtgoldblatt I'm stunned at how many people think going to 128b pointers will be a thing, given that 64b pointers are frequently overkill for a good proportion of processes already.

English

478

David Goldblatt@davidtgoldblatt·18 Mar

Poll: what is the largest address-space size you think a "regular" ISA will have over the next 50 years? For concreteness a "regular" ISA means one selling > 1M cell phones / laptops / servers, either CPU or GPU.

English

1.5K

Jeff Smith@JeffSmith888·18 Mar

@davidtgoldblatt The implications of 128+b VAs on TLB tag matching and PT walks for aren't fun. Likewise, 128+b scalar core datapaths (of little gain besides handling those addrs) would be expensive enough that reintroducing segmentation instead feels almost possible, if ever actually needed. 😂

English

Jeff Smith@JeffSmith888·28 Şub

@GawroskiT Anybody else think it's nuts to kneecap flow-through cooling performance to protect what looks to be a completely optional ARGB cable?

English

137

Tomasz Gawroński@GawroskiT·28 Şub

Sapphire nitro+ 9070xt has protections in place for the 12VHPWR connector. Not only that they even covered the radiator FINS so the cable wont get damaged! And on top of that magnetic back cover to hide the cable. Superb engineering.

English

140

207

2.5K

113.3K

Jeff Smith@JeffSmith888·17 Ara

@wassickt I'm essentially always over-optimistic on these things, but I was really hoping the design would be F2F (i.e., SRAM TSVs for CCD power/GMI IO only) and maybe even W2W assembly. Thanks again for the hard work! I hope this helps pressure AMD to be more generous at ISSCC in Feb. 😄

English

Tom Wassick@wassickt·17 Ara

@JeffSmith888 I suspect that the smaller CCD than the SRAM required the oxide fill at the edge, and it had to be thin to do that effectively in the processing. Of course that raises the question on why not just shrink the SRAM a bit..

English

103

Tom Wassick@wassickt·17 Ara

9800X3D XSec Highlights: o Both CCD and SRAM are thinned (sub 10 um) , so thick "dummy Si" oxide oxide bonded to the stack o SRAM Si area is larger than the CCD -- there's a 50 um "oxide edge" for the CCD o As with the 2nd gen, the BPV's are terminated on the Al of the CCD

English

13.1K

Jeff Smith@JeffSmith888·17 Ara

@wassickt To clarify, is the CCD mounted BEOL down (towards the SRAM die) or up (flipped and using TSVs of its own)? And if it’s the former, any guesses about why the CCD wouldn’t have just been thinned by ~45um to skip the dummy 750um cap altogether?

English

Tom Wassick@wassickt·17 Ara

o With BEOL's included, the die stack is about 40- 45 um thick o Total stack is close to 800, so remaining 750 is the dummy Si over the top

English

1.6K

Jeff Smith@JeffSmith888·10 Kas

@CDemerjian This sounds like declaring you’ll never use your boats to go back, but being explicitly positive above them not being burned. Maybe too much a half measure?

English

189

Charlie Demerjian@CDemerjian·10 Kas

Just made an account on another social network with a color and a thing above your head. Same handle but i don't want to put the name in until I know there isn't retribution for posting a competitor to twitter. Join me if you want to interact.

English

2.8K

Jeff Smith@JeffSmith888·10 Kas

@hkultala @SebAaltonen Arm (likely wisely) forbade anyone besides Apple from doing modal TSO toggles, but LDAR/STLR moves did get added. The real issue isn't implicit/explicit fencing but perf. OOO cores able to handle floods of translated barriers are just a higher minimum bar than Arm wants.

English

Heikki Kultala@hkultala·9 Kas

@SebAaltonen For efficient emulation of x86, a compatible memory consistency mode (TSO?) is needed. Apple has this as unofficial extension, ARM needs to make it official and Microsoft needs to add support for it in the binary compiler of the emulator. Before that, bad x86 emulation perf

English

1.1K

Sebastian Aaltonen@SebAaltonen·9 Kas

Nvidia must depend on Intel CPUs for laptops and this is risky. Intel is falling apart. Nvidia's own ARM SoC will solve this problem. Just like Apple solved their problems a few years ago. Nvidia never had x64 license, but now Windows supports ARM and they have their own ARM CPU.

English

11.7K

Jeff Smith@JeffSmith888·7 Kas

@SebAaltonen @Gnattuoc Is this the product that will physically murder the UserBenchmark guy?

English

Sebastian Aaltonen@SebAaltonen·6 Kas

@Gnattuoc 9950X3D would have none of these problems if it had 3D cache on both dies. Their new 3D cache is under the die, so it doesn't limit clock rate. They don't need to compromise anymore. They can now build the ultimate CPU that beats everything in every test.

English

314

Sebastian Aaltonen@SebAaltonen·6 Kas

Last week I said that M4 Max will finally be slightly faster than Intel and AMD consumer desktop chips in both MT and ST. Today 9800X3D with new 3D cache design under the chiplet shows such big gains that 9950X3D with 3D cache under both chiplets would still beat M4 Max.

English

6.8K

Jeff Smith@JeffSmith888·6 Kas

@SebAaltonen @Bad_AI_ Bro we’re down to half a lane now since dual-issue VLIW2-like fp32 became a thing.

English

163

Sebastian Aaltonen@SebAaltonen·5 Kas

@Bad_AI_ Nvidia naming is horrible. They call each SIMD lane a CUDA core. SIMD lanes can't execute independent instruction. They are not cores. CUDA core is a marketing name. They just wanted a big number. Compare Nvidia SMs to Apple cores. SM is the real shader core in Nvidia GPUs.

English

4.1K

Sebastian Aaltonen@SebAaltonen·5 Kas

You don't need 80 core GPU to enter mainstream gaming market. 40 core GPU in M4 Max is already enough to compete against PS5 Pro. The problem is the price. M4 Max laptop with 40 core GPU + 2TB SSD costs 5000€. Mainstream gamers can't afford that. Ultra is even less affordable.

Wccftech@wccftech

M4 Ultra designed for the Mac Pro has been hinted to feature up to an 80-Core GPU, as Apple said to be in a better position to enter the mainstream gaming market wccftech.com/m4-ultra-for-t…

English

627

68.8K

Jeff Smith@JeffSmith888·3 Kas

@SebAaltonen @Sebasti66855537 E.g., not needing to throw more transcendental EUs in a hypothetical segregated AI block for activation functions, and not having to worry about balancing vector vs matrix units on differently scaled up chips.

English

Jeff Smith@JeffSmith888·3 Kas

@SebAaltonen @Sebasti66855537 What are the rough size scales of AI passes in rendering work graphs? Although trips off and back to a GPU aren’t tolerable in realtime gfx, it’s not clear to me that ALU/SM-level serves as much a shader latency need as a chip designer convenience.

English

Sebastian Aaltonen@SebAaltonen·3 Kas

I am glad that Sony is pushing AMD. They wanted tensor cores and fast RT in PS5 Pro. AMDi s unifying RDNA/CDNA as UDNA. This means we get full CDNA tensor cores also in consumer Radeons soon. Nvidia did the same already in Turing...

Osvaldo Pinali Doederlein@opinali

NEW BLOG! "The PS5 Pro, RDNA 4 & FSR 4.0". This one is two things at the same time: a well-edited overview of lots of recent news & explainers, and my speculative instincts gone wild. No, you can't have the useful and correct part without the crazy part. link.medium.com/beuSHErldOb

English

273

31.7K

Jeff Smith@JeffSmith888·1 Kas

@Darth_Goldsmith @wassickt @aschilling @GamersNexus Looking at device characteristics alone (assuming process DTC/L3 SRAM compatibility), F2F WoW seems ideal for CCD-on-L3D from a lot of thermal, mechanical, and power delivery standpoints. But yeah that's a pretty big "if" that TSMC didn't fully pipeclean with Graphcore.

English

Jeff Smith@JeffSmith888·1 Kas

@Darth_Goldsmith @wassickt @aschilling @GamersNexus Yeah, I bugged some people at VLSI in May about how viable a F2F MI300 design would have been and got dutifully noncommittal answers about design/validation and assembly tooling readiness as well as the obvious additional base die power TSV needs.

English

Andreas Schilling 🇺🇦@aschilling·31 Eki

In some comments AMD did provide some more details to @GamersNexus, which I have summarized / put together.

Andreas Schilling 🇺🇦@aschilling

In his announcement @jackhuynh is confirming the SRAM chip is now sitting below the CCD for the Ryzen 7 9800X3D. By the looks the SRAM chip also has the same size than the CCD. But it then also does need to interface signals and power for the CCD, right?

English

6.5K

Keşfet

@pshufb @molqtemlukni @SuchirKavi @leonard_coder @lemire @wmf @tim_cook @geofflangdale