Jeff Smith

1.4K posts

Jeff Smith banner
Jeff Smith

Jeff Smith

@JeffSmith888

HFT automaton

Chicago, IL Katılım Ocak 2019
109 Takip Edilen229 Takipçiler
Jeff Smith
Jeff Smith@JeffSmith888·
@pshufb @molqtemlukni @SuchirKavi I can’t grasp the win here. Intermediate calcs still need an operand namespace (if not arch reg names), and OOO retire impls can update alias tables instead of value copies.
English
0
0
2
31
Wojciech Muła
Wojciech Muła@pshufb·
@molqtemlukni @SuchirKavi The key point is not to escape from the OoO machinery when not needed. The registers are anyway allocated in PRF. In my architecture, the selected PRF values are copied out only on commit, "temporaries" are discarded.
English
2
0
0
100
Wojciech Muła
Wojciech Muła@pshufb·
The more I learn CPUs the more obvious to me is that the idea that every instruction has to update the architectural state is stupid. There's should be a separate commit operation, that forces update of selected/whole architectural state. Then a CPU would execute "basic blocks".
English
7
0
18
1.6K
Jeff Smith
Jeff Smith@JeffSmith888·
@leonard_coder @lemire What is "huge"? Simple AES-CTR gets you literal zettabits of values before reaching the "hey maybe change keys" limits for cryptographic security. I think the debate's mostly settled that MT's state space size is beyond any true practical use's fundamental needs.
English
1
0
2
36
Arnaud Carré
Arnaud Carré@leonard_coder·
@lemire probably not better regarding speed, but is there any better regarding huge period?
English
2
0
0
327
Daniel Lemire
Daniel Lemire@lemire·
Sebastiano is right. Do not use Mersenne Twister. It is unnecessarily slow and it is not any better than much faster alternatives.
Daniel Lemire tweet media
English
3
3
45
5.8K
Jeff Smith
Jeff Smith@JeffSmith888·
@wmf @lemire I've heard (and personally claimed) that n*log(n) sounded a lot more reasonable, but never sub-linear!😆 Also pretty sure that Metcalfe himself admitted it was alway just a flashy marketing claim.
English
0
0
1
16
Wes Felter
Wes Felter@wmf·
@lemire (Metcalfe’s Law is fake BTW. It's more like the square root.)
English
1
0
0
64
Daniel Lemire
Daniel Lemire@lemire·
Metcalfe’s Law against Brooks’ Law Guido van Rossum, Python’s creator, recently said: “We have a huge community, but relatively few people, relatively speaking, are contributing meaningfully.” This highlights a paradox. Software thrives on the network effect, or Metcalfe’s Law, where a system’s value scales with the square of its users. Linux excels because its vast user base fuels adoption, documentation, and compatibility everywhere. But larger teams don’t build better software—often the reverse. Brooks’ Law, from Fred Brooks’ The Mythical Man-Month, shows that adding people increases communication overhead, slowing progress. The Pareto Principle (80/20 rule) also applies: a small minority drives most meaningful contributions. Great software often stems from a single visionary or a small, cohesive team, not a crowd. The network effect applies primarily to users, not necessarily to creators.
Daniel Lemire tweet media
English
6
8
74
6.8K
Jeff Smith
Jeff Smith@JeffSmith888·
@tim_cook Sorry Tim, but whatever leadership proposed, lead, or approved of this design needs to be removed. Has Apple forgotten that mobile GUIs fundamentally require usability in adverse lighting? Forget aesthetics and computational waste, Liquid Glass will just suck in real-life usage.
English
1
0
1
113
Tim Cook
Tim Cook@tim_cook·
Expressive. Delightful. But still instantly familiar. Introducing our new software design with Liquid Glass.
English
5.4K
4.4K
45.8K
28.8M
Jeff Smith
Jeff Smith@JeffSmith888·
@geofflangdale Is there a more rigorous alternative to “SIMD Instructions Considered Harmful”? The more prominently shared R-V/Berkeley/Patterson arguments have always seemed dangerously specious to me, but I’d rather not write off varlen due to wide audience arguments alone.
English
1
0
5
302
Jeff Smith
Jeff Smith@JeffSmith888·
@AzorFrank I’ve been very pleased with my 6700 XT powering my home workstation, but the 9060’s specs fall short of being the upgrade I hoped for. No UBR20 was disappointing but not the deal-breaker that only 3 display outputs is.
English
0
0
0
171
Frank Azor
Frank Azor@AzorFrank·
How many displays do you have connected to your dGPU card right now?
English
55
6
70
15.9K
Jeff Smith
Jeff Smith@JeffSmith888·
@SebAaltonen @_memerao C (unfortunately IMO) chose declaration-follows-use (yay clockwise spiral rule 🙄), and I don't feel great about formatting that belies the actual grammar, so I stick with int *ptr; only for disliking it less, not for liking it more.
English
0
0
0
169
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
@_memerao int* myIntPtr; I want to clearly separate the type and the variable name. The type is int*, the variable name is myIntPtr.
English
8
1
127
4.5K
Bēmbɔɪ Bædkar
Bēmbɔɪ Bædkar@_memerao·
To be or not to be is to grammarians, What 𝙞𝙣𝙩* 𝙭 or 𝙞𝙣𝙩 *𝙭 is to programmers. A matter of style, a cause for debate, Yet the compiler cares not, it still compiles fate. Are you a *𝘭𝘦𝘧𝘵 or a *𝘳𝘪𝘨𝘩𝘵 ?
Bēmbɔɪ Bædkar tweet media
English
48
8
77
17.1K
Jeff Smith
Jeff Smith@JeffSmith888·
@icculus I’d be happy with just opt-in decoupling of layout and de/init ordering in C++, even if it could make even less pretty field lists. Cache line packing & init order deps can sometimes oppose each other. IOW, there’re at least 3 orthogonal needs with RAII style langs.
English
0
0
1
98
Ryan C. Gordon
Ryan C. Gordon@icculus·
It would be neat if there was a way to signal to the C compiler: "reorder the fields in this struct for optimal alignment/padding in a way the whole system will agree to, but let me list them in whatever order I want in the header."
English
21
3
295
22.5K
Jeff Smith
Jeff Smith@JeffSmith888·
@geofflangdale @davidtgoldblatt Not only is 64b frequently overkill now, but AFAIK nobody even goes beyond 57b VA support yet, and LAM etc. extensions to scavenge the high bits are getting increasing attention.
English
0
0
3
55
Geoff Langdale
Geoff Langdale@geofflangdale·
@davidtgoldblatt I'm stunned at how many people think going to 128b pointers will be a thing, given that 64b pointers are frequently overkill for a good proportion of processes already.
English
5
0
12
478
David Goldblatt
David Goldblatt@davidtgoldblatt·
Poll: what is the largest address-space size you think a "regular" ISA will have over the next 50 years? For concreteness a "regular" ISA means one selling > 1M cell phones / laptops / servers, either CPU or GPU.
English
6
0
5
1.5K
Jeff Smith
Jeff Smith@JeffSmith888·
@davidtgoldblatt The implications of 128+b VAs on TLB tag matching and PT walks for aren't fun. Likewise, 128+b scalar core datapaths (of little gain besides handling those addrs) would be expensive enough that reintroducing segmentation instead feels almost possible, if ever actually needed. 😂
English
0
0
2
79
Jeff Smith
Jeff Smith@JeffSmith888·
@GawroskiT Anybody else think it's nuts to kneecap flow-through cooling performance to protect what looks to be a completely optional ARGB cable?
English
0
0
0
137
Tomasz Gawroński
Tomasz Gawroński@GawroskiT·
Sapphire nitro+ 9070xt has protections in place for the 12VHPWR connector. Not only that they even covered the radiator FINS so the cable wont get damaged! And on top of that magnetic back cover to hide the cable. Superb engineering.
Tomasz Gawroński tweet mediaTomasz Gawroński tweet mediaTomasz Gawroński tweet mediaTomasz Gawroński tweet media
English
140
207
2.5K
113.3K
Jeff Smith
Jeff Smith@JeffSmith888·
@wassickt I'm essentially always over-optimistic on these things, but I was really hoping the design would be F2F (i.e., SRAM TSVs for CCD power/GMI IO only) and maybe even W2W assembly. Thanks again for the hard work! I hope this helps pressure AMD to be more generous at ISSCC in Feb. 😄
English
0
0
0
45
Tom Wassick
Tom Wassick@wassickt·
@JeffSmith888 I suspect that the smaller CCD than the SRAM required the oxide fill at the edge, and it had to be thin to do that effectively in the processing. Of course that raises the question on why not just shrink the SRAM a bit..
English
1
0
0
103
Tom Wassick
Tom Wassick@wassickt·
9800X3D XSec Highlights: o Both CCD and SRAM are thinned (sub 10 um) , so thick "dummy Si" oxide oxide bonded to the stack o SRAM Si area is larger than the CCD -- there's a 50 um "oxide edge" for the CCD o As with the 2nd gen, the BPV's are terminated on the Al of the CCD
English
7
13
83
13.1K
Jeff Smith
Jeff Smith@JeffSmith888·
@wassickt To clarify, is the CCD mounted BEOL down (towards the SRAM die) or up (flipped and using TSVs of its own)? And if it’s the former, any guesses about why the CCD wouldn’t have just been thinned by ~45um to skip the dummy 750um cap altogether?
English
2
0
0
52
Tom Wassick
Tom Wassick@wassickt·
o With BEOL's included, the die stack is about 40- 45 um thick o Total stack is close to 800, so remaining 750 is the dummy Si over the top
English
2
0
16
1.6K
Jeff Smith
Jeff Smith@JeffSmith888·
@CDemerjian This sounds like declaring you’ll never use your boats to go back, but being explicitly positive above them not being burned. Maybe too much a half measure?
English
1
0
0
189
Charlie Demerjian
Charlie Demerjian@CDemerjian·
Just made an account on another social network with a color and a thing above your head. Same handle but i don't want to put the name in until I know there isn't retribution for posting a competitor to twitter. Join me if you want to interact.
English
10
3
21
2.8K
Jeff Smith
Jeff Smith@JeffSmith888·
@hkultala @SebAaltonen Arm (likely wisely) forbade anyone besides Apple from doing modal TSO toggles, but LDAR/STLR moves did get added. The real issue isn't implicit/explicit fencing but perf. OOO cores able to handle floods of translated barriers are just a higher minimum bar than Arm wants.
English
0
0
0
33
Heikki Kultala
Heikki Kultala@hkultala·
@SebAaltonen For efficient emulation of x86, a compatible memory consistency mode (TSO?) is needed. Apple has this as unofficial extension, ARM needs to make it official and Microsoft needs to add support for it in the binary compiler of the emulator. Before that, bad x86 emulation perf
English
2
0
4
1.1K
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
Nvidia must depend on Intel CPUs for laptops and this is risky. Intel is falling apart. Nvidia's own ARM SoC will solve this problem. Just like Apple solved their problems a few years ago. Nvidia never had x64 license, but now Windows supports ARM and they have their own ARM CPU.
English
6
2
79
11.7K
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
@Gnattuoc 9950X3D would have none of these problems if it had 3D cache on both dies. Their new 3D cache is under the die, so it doesn't limit clock rate. They don't need to compromise anymore. They can now build the ultimate CPU that beats everything in every test.
English
1
0
0
314
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
Last week I said that M4 Max will finally be slightly faster than Intel and AMD consumer desktop chips in both MT and ST. Today 9800X3D with new 3D cache design under the chiplet shows such big gains that 9950X3D with 3D cache under both chiplets would still beat M4 Max.
English
1
4
83
6.8K
Jeff Smith
Jeff Smith@JeffSmith888·
@SebAaltonen @Bad_AI_ Bro we’re down to half a lane now since dual-issue VLIW2-like fp32 became a thing.
English
0
0
1
163
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
@Bad_AI_ Nvidia naming is horrible. They call each SIMD lane a CUDA core. SIMD lanes can't execute independent instruction. They are not cores. CUDA core is a marketing name. They just wanted a big number. Compare Nvidia SMs to Apple cores. SM is the real shader core in Nvidia GPUs.
English
3
2
44
4.1K
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
You don't need 80 core GPU to enter mainstream gaming market. 40 core GPU in M4 Max is already enough to compete against PS5 Pro. The problem is the price. M4 Max laptop with 40 core GPU + 2TB SSD costs 5000€. Mainstream gamers can't afford that. Ultra is even less affordable.
Wccftech@wccftech

M4 Ultra designed for the Mac Pro has been hinted to feature up to an 80-Core GPU, as Apple said to be in a better position to enter the mainstream gaming market wccftech.com/m4-ultra-for-t…

English
42
43
627
68.8K
Jeff Smith
Jeff Smith@JeffSmith888·
@SebAaltonen @Sebasti66855537 E.g., not needing to throw more transcendental EUs in a hypothetical segregated AI block for activation functions, and not having to worry about balancing vector vs matrix units on differently scaled up chips.
English
0
0
0
46
Jeff Smith
Jeff Smith@JeffSmith888·
@SebAaltonen @Sebasti66855537 What are the rough size scales of AI passes in rendering work graphs? Although trips off and back to a GPU aren’t tolerable in realtime gfx, it’s not clear to me that ALU/SM-level serves as much a shader latency need as a chip designer convenience.
English
1
0
0
87
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
I am glad that Sony is pushing AMD. They wanted tensor cores and fast RT in PS5 Pro. AMDi s unifying RDNA/CDNA as UDNA. This means we get full CDNA tensor cores also in consumer Radeons soon. Nvidia did the same already in Turing...
Osvaldo Pinali Doederlein@opinali

NEW BLOG! "The PS5 Pro, RDNA 4 & FSR 4.0". This one is two things at the same time: a well-edited overview of lots of recent news & explainers, and my speculative instincts gone wild. No, you can't have the useful and correct part without the crazy part. link.medium.com/beuSHErldOb

English
7
34
273
31.7K
Jeff Smith
Jeff Smith@JeffSmith888·
@Darth_Goldsmith @wassickt @aschilling @GamersNexus Looking at device characteristics alone (assuming process DTC/L3 SRAM compatibility), F2F WoW seems ideal for CCD-on-L3D from a lot of thermal, mechanical, and power delivery standpoints. But yeah that's a pretty big "if" that TSMC didn't fully pipeclean with Graphcore.
English
0
0
1
94
Jeff Smith
Jeff Smith@JeffSmith888·
@Darth_Goldsmith @wassickt @aschilling @GamersNexus Yeah, I bugged some people at VLSI in May about how viable a F2F MI300 design would have been and got dutifully noncommittal answers about design/validation and assembly tooling readiness as well as the obvious additional base die power TSV needs.
English
1
0
0
63