OS Dev

306 posts

OS Dev banner
OS Dev

OS Dev

@OSdev_

Senior Engineer @Qualcomm - Performance Engineering | Windows kernel | C/C++ | ARM64 | CPU & Memory Microarchitectures | SoC's

void* Katılım Haziran 2024
651 Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
OS Dev
OS Dev@OSdev_·
A new blog on Windows - thread management, thread synchronization primitives with examples. Explained : 1. Race conditions with simple and as well as low-level explanation 2. Dead locks 3. Interview ready problems like producer consumer problem etc. osdev.medium.com/windows-intern…
English
1
5
27
4.4K
OS Dev
OS Dev@OSdev_·
LSE introduced on Armv8.1-A, it improved the performance of atomic operations in multi-core and reduced bus traffic. The operations like LDADD(atomic add), CAS (compare and swap) performs the entire atomic read-modify-write in a single instruction. Most beautiful thing: "LSE pushes the arithmetic/logic operation down into the cache subsystem (the interconnect or L3 coherency controller) where the data already resides" - Compute near data instead of CPU fetching the data, modifying it and storing it again to avoid cache ping-pong problems which existed in before v8.1 with LDAXR, STLXR (it fails if another core modifies the value so triggers a retry - more on this later)
OS Dev@OSdev_

I started reading about Store-to-Load forwarding and now I'm exploring the entire memory subsystem microarchitecture. Currently reading about ARMv8.1-A Large System Extensions(LSE), AMBA CHI protocols(Communication protocol used between memory subsystems like cache-to-cache interconnects). Will write about this in detail soon :)

English
0
0
3
285
OS Dev
OS Dev@OSdev_·
@greenboxal It seems like it's because of Cache coherence protocols. Single instruction tends to own the cache line until it completes the execution.
English
0
0
0
37
Jonathan Lima - complexity/acc
I think it’s a bit deeper than that: there’s no guarantee you can write a program that shows a violation, and wouldn’t actually mean all it reads are actually tear free it might just be you never found the one counter-example. for example it might be that by the uniformity of your programs, with no other races, task switches, etc, it just happens it generated one schedule that happens to work. if you wanna really stress it make all other cores write as well, increasing the invalidation queue depth
English
2
0
1
23
OS Dev
OS Dev@OSdev_·
Single Load and store are Atomic: What I learnt:- In modern CPU architectures, *aligned* data accesses are atomic for single loads and stores, provided the data types are *naturally aligned*. In short, what it means is that the reader shouldn't see tearing. On my hardware :-
OS Dev tweet media
English
1
1
26
1.5K
OS Dev
OS Dev@OSdev_·
@greenboxal This makes more sense. It's difficult to find such examples as well. I've tried plenty but no success.
English
0
0
0
29
OS Dev
OS Dev@OSdev_·
@greenboxal I believe this could be it. But I'll check the PMU counters to understand more. So, probably instruction cycles should be more for the unaligned data but let me check.
English
1
0
0
29
Jonathan Lima - complexity/acc
@OSdev_ In fact in x86_64, I believe you can read 128 bits tear-free, into an AVX register by the same logic. You can play some dangerous games with hazard pointers here, but that actually work
English
1
0
1
32
Zip CPU
Zip CPU@zipcpu·
@OSdev_ Adapters and bridges cost latency and logic. Why waste the latency on a critical path, when you could have built the CPU's memory controller to handle arbitrary widths in the first place? It's not that much harder, and there's quite a benefit to doing so.
English
1
0
10
474
Zip CPU
Zip CPU@zipcpu·
Wow ... it's amazing how many unusable RISC-V SOCs there are out there. 1) If you are going to build a CPU, the CPU's memory bus interface should match the width of the memory. (Today's bug.) 2) If you want to use SDRAM, don't use Wishbone Classic.
English
4
6
101
7.5K
OS Dev
OS Dev@OSdev_·
I started reading about Store-to-Load forwarding and now I'm exploring the entire memory subsystem microarchitecture. Currently reading about ARMv8.1-A Large System Extensions(LSE), AMBA CHI protocols(Communication protocol used between memory subsystems like cache-to-cache interconnects). Will write about this in detail soon :)
English
0
0
2
583
OS Dev
OS Dev@OSdev_·
How does the clock tell the CPU to move forward? youtu.be/PVNAPWUxZ0g?si…
YouTube video
YouTube
OS Dev@OSdev_

After many years, I finally got a chance to read about Clocks and Timers. I've read about LC circuits, impedance, resonance, oscillators(feedback + loop gains + inverters + amplifiers), and piezoelectric concepts in Electrical engineering. But never had a chance to actually work/understand the concept of CPU clocks or Timers because I never went in-depth into VLSI or into electronics in general. These are the same concepts used in Clocks and Timers. Quartz Crystal is a kind of LC circuit with High Q factor, which results in precise and stable frequencies. But it's not a conductor ! Quartz Crystal inside an electric field changes its polarity thus generating high voltages. When the AC voltage from the circuit hits the crystal, it physically flexes. If that AC frequency matches the crystal’s "natural" mechanical frequency, the crystal vibrates violently (resonance) and its impedance drops sharply. This is how it generates stable frequency ! The combination below is what makes a CPU clock. 1. Pierce oscillator 2. Phase locked loops (PLL) circuits(Since the oscillator just acts as a frequency selector - stable small frequency in MHz, this circuit amplifies to high frequencies) 3. Clock trees (converting sine-ish signals to digital squarish waves) Think of Timers as just counters with stable frequency thus providing the proper times for OS. I am still learning, so lots of over-simplications are there. I'll share more in upcoming tweets.

English
1
3
38
1.7K
OS Dev
OS Dev@OSdev_·
After many years, I finally got a chance to read about Clocks and Timers. I've read about LC circuits, impedance, resonance, oscillators(feedback + loop gains + inverters + amplifiers), and piezoelectric concepts in Electrical engineering. But never had a chance to actually work/understand the concept of CPU clocks or Timers because I never went in-depth into VLSI or into electronics in general. These are the same concepts used in Clocks and Timers. Quartz Crystal is a kind of LC circuit with High Q factor, which results in precise and stable frequencies. But it's not a conductor ! Quartz Crystal inside an electric field changes its polarity thus generating high voltages. When the AC voltage from the circuit hits the crystal, it physically flexes. If that AC frequency matches the crystal’s "natural" mechanical frequency, the crystal vibrates violently (resonance) and its impedance drops sharply. This is how it generates stable frequency ! The combination below is what makes a CPU clock. 1. Pierce oscillator 2. Phase locked loops (PLL) circuits(Since the oscillator just acts as a frequency selector - stable small frequency in MHz, this circuit amplifies to high frequencies) 3. Clock trees (converting sine-ish signals to digital squarish waves) Think of Timers as just counters with stable frequency thus providing the proper times for OS. I am still learning, so lots of over-simplications are there. I'll share more in upcoming tweets.
OS Dev tweet media
English
3
33
238
7.4K
OS Dev
OS Dev@OSdev_·
@j4_v_3s It's definitely not helpful in all cases. I was referring to simple logics like - helper(add); helper(multiply); helper() takes the function pointer and does something with it - in simple terms, it can be read as helper() accepts behaviour.
English
1
0
0
77
James
James@j4_v_3s·
@OSdev_ In what way are they beneficial for readability?
English
1
0
0
74
OS Dev
OS Dev@OSdev_·
Function pointers are really beneficial for better readability and coding patterns. But the key issue in performance critical programs is that the compiler cannot inline if it's determined at runtime(dynamic decisions). Because of indirect branches which hurt branch predictions, which in turn causes pipeline stalls. Recently I've completed cache internals and full on about Store-to-Load forwarding. In the coming days, we will look more into branch predictions :)
Piyush Itankar@_streetdogg

[C] Function Pointers - Everything you need to know! Function pointers are pointers that point to code. When these are dereferenced, the fetched data is treated like instructions and the CPU executes them. "" When we dereference a function pointer, the PC register in the CPU is set to the address held by the pointer. "" The full note is available here: pyjamacafe.com/posts/function…

English
1
6
105
8.2K
OS Dev
OS Dev@OSdev_·
Continuation... Let's discuss how LDR and STR instructions actually get executed. *p = 10 //store (older entry) x = *p //load (younger entry) For *p = 10, the CPU is not directly writing to Cache, instead it goes to Store Queue(a temporary store). Here, data might come later, but the address is written first. For x=*p, Address Generation Unit(AGU - will discuss this more in coming tweets) computes the address and Load Queue gets an entry with multiple fields(like address, age - for instruction order, data, ready/status bits). Load Queue will track this load. When load executes there are actually multiple stages to it. I am not going into deep just an overview. 1. Dependency check - checking if the address is present in Store Queue with an older entry(by program order). Let's assume, the address is present - so it either gets the data or yet to arrive. - if data is already there, which is what Store-to-Load forwarding means. - if data is yet to come, it can either stall/wait for the data or execute load anyway(whattttt????? Yes, that was my first reaction because it all seems like some kind of magical world) 2. Speculation - there's an entry in Store Queue but data is not there yet. So, if loads executes then it reads from cache(reading wrong data). Now, if store executes and an entry for the data, the Load Queue immediately detects that executed load is wrong and need to correct it. 3. Replay - Load Queue now needs to invalidate the entry, flushes those younger entries and re-execute the load. This time we will get data from Store Queue or wait properly. How exactly it detects whether we need to replay? This is interesting. But of course it's very complex as well so just an overview would suffice for our understanding. When the store entry gets the data after load is executed (got wrong/older data from cache), Load Queue again re-checks the status bits of Store Queue entry (by matching addresses). If Store Queue entry status bits are ready, but load is already executed that's when a replay will happen. Why do all this stuff instead of just wait for store finishing or execute in-order ? Because of Out-of-order execution, it speeds up many things and keeps pipeline busy ! That's how we squeeze the performance :) This is just an over simplification of the complex subject. I still need to read into memory ordering and more into multiple parallel executions and importantly AGU's.
OS Dev@OSdev_

From here, I went on deep to understand Cache internals and architecture. While reading about that I found a weird statement, "Misaligned loads are cheaper than misaligned stores." My mental model about LDR and STR was wrong. The above statement stands true. Because LDR just needs to read 2 cache lines if it's misaligned access(even better if it can find it in store buffer/queue - Store-to-Load forwarding). But STR needs to **read** 2 caches lines first, yes why read ? Because we need to preserve the existing data. Once it read/fetch it, now it needs ownership(in multi-core, cache coherence comes into picture) to modify the respective bytes and write to store buffer. Yes, to store buffer because it helps Out-of-order execution better. Later, that store buffer is flushed to cache and so on. I come from an electrical background so the mental model was very simple LDR means reads from cache(cache hit/miss) and STR means writes to a cache and nothing deeper than this.

English
0
0
20
1.6K
OS Dev
OS Dev@OSdev_·
From here, I went on deep to understand Cache internals and architecture. While reading about that I found a weird statement, "Misaligned loads are cheaper than misaligned stores." My mental model about LDR and STR was wrong. The above statement stands true. Because LDR just needs to read 2 cache lines if it's misaligned access(even better if it can find it in store buffer/queue - Store-to-Load forwarding). But STR needs to **read** 2 caches lines first, yes why read ? Because we need to preserve the existing data. Once it read/fetch it, now it needs ownership(in multi-core, cache coherence comes into picture) to modify the respective bytes and write to store buffer. Yes, to store buffer because it helps Out-of-order execution better. Later, that store buffer is flushed to cache and so on. I come from an electrical background so the mental model was very simple LDR means reads from cache(cache hit/miss) and STR means writes to a cache and nothing deeper than this.
OS Dev@OSdev_

The more I learn about CPU architecture, the more I start to believe that extraterrestrial technology exists. What do you mean by Store-to-Load forwarding :D exists.

English
0
2
25
4K
OS Dev
OS Dev@OSdev_·
The more I learn about CPU architecture, the more I start to believe that extraterrestrial technology exists. What do you mean by Store-to-Load forwarding :D exists.
English
0
1
9
2.5K
OS Dev
OS Dev@OSdev_·
We learn CPU pipelines as Fetch → Decode → Execute But in reality, it's more like this Fetch → Decode → Rename → Dispatch → Execute → Writeback → Commit 1. Fetch ? - itself is in multiple stages, fetching from L1 I-cache, branch predictions (this itself deserves its a book to explain it) 2. what exactly Renaming is ? - to remove false dependencies, what do I mean by that? Let's say you have two registers being used in multiple instructions. Now, renaming the second instruction register to a different register can actually remove the false dependencies if there are any otherwise stalls. 3. Dispatch & Execute ? - multiple units like ALU, SIMD, Branch units, load/store unit 4. Commit ? - ensures in-order architectural state. (Because modern CPUs can execute Out-of-order(very interesting topic) to keep pipelines busy) All these cannot happen in one cycle. So, every instruction can take multiple cycles sometimes depending on many things.
Inside Computing@insidecomput

People think CPUs run instructions one by one. Wrong. They use a pipeline: Fetch → Decode → Execute → Memory → Writeback And these happen at the same time. That’s how CPUs get fast. Not smarter. Just more parallel. #CPU #ComputerArchitecture

English
1
0
14
763