OS Dev

875 posts

OS Dev

@OSdev_

void* Katılım Haziran 2024

913 Takip Edilen4.7K Takipçiler

Sabitlenmiş Tweet

OS Dev@OSdev_·20 Haz

Read “Windows Internals: Thread Management — Part 1“ by OS Dev on Medium: This article discusses about ETHREAD, KTHREAD kernel objects & windows scheduler - how it schedules a thread. medium.com/windows-os-int…

English

178

19.6K

OS Dev@OSdev_·3d

@Abhishekcur It's really very easy to fall into this trap and feel like we are learning everyday just by liking and bookmarking content or just by knowing the names of the concepts.

English

305

Abhishek🌱@Abhishekcur·3d

you are not "behind." you're just sitting on 400 bookmarks, 12 unread books, and 30 tabs pretending to be progress. collecting resources is not the same as learning. here's the trap almost everyone falls into

English

2.5K

OS Dev@OSdev_·3d

It's often said that calling malloc() or new() immediately allocates RAM, but in practice things can be a bit more nuanced. When a program requests memory, the operating system may reserve a range of virtual address space rather than assigning physical memory right away. The pointer returned is valid, but many of the pages behind that address range might not yet be backed by physical RAM. Physical pages are often allocated lazily. When the program first accesses a page, the CPU may trigger a page fault because the page isn't mapped to physical memory. The operating system then handles the fault, allocates a physical page (or maps a shared zero page, depending on the situation), updates the page tables, and resumes execution. This process is known as demand paging. This approach can help reduce startup overhead and avoid using RAM for memory that a program reserves but never actually touches. It also helps explain why an application might successfully allocate a large block of memory while using much less physical memory until it begins accessing that space. So, while malloc() or new() provides your program with virtual memory, physical RAM is often only committed when the memory is actually used.

English

OS Dev@OSdev_·3d

One of the biggest operating system myths is that system calls are just interrupts. They aren't. A system call is a controlled transition from user mode to kernel mode that lets an application request a privileged service, such as reading a file, allocating memory, creating a process, or sending data over the network. The CPU switches to the kernel through a dedicated mechanism designed specifically for system calls. On modern x86 processors this is typically done with the SYSCALL/SYSRET instructions, while ARM64 uses the SVC (Supervisor Call) instruction. A hardware interrupt is different. It is generated asynchronously by hardware devices, such as a timer, keyboard, SSD, or network card, to notify the CPU that something needs attention. Interrupts can occur at almost any time, independent of what the currently running program is doing. Both system calls and interrupts enter the kernel, but they exist for different reasons. A system call is a synchronous request initiated by software, while an interrupt is an asynchronous event generated by hardware. Confusing the two makes it harder to understand how operating systems manage privilege, devices, and execution flow.

English

2.2K

OS Dev@OSdev_·3d

arxiv.org/pdf/2412.18104

ZXX

509

OS Dev@OSdev_·3d

Modern CPUs pack a ton of cores onto a single chip, which is great in theory but in practice, those cores can end up stepping on each other’s toes. This is what people mean by cross-core interference. Basically, when multiple cores are running different tasks, they’re all sharing certain hardware resources like caches, memory bandwidth, and interconnects. One core might evict another core’s cache lines, or they might compete for memory access, or generate extra coherence traffic that slows everything down. All of this adds up to higher latency and lower overall throughput. This paper talks about six years of real-world experience building a Linux scheduler that’s aware of these interference effects. Instead of just placing tasks wherever there’s space, the scheduler tries to be smarter about it, grouping compatible workloads together, isolating sensitive or latency-critical tasks, and generally making better use of shared resources. The goal is to reduce unnecessary contention and keep things running smoothly. What’s interesting is that all of these improvements come purely from software changes no new hardware required. By being more intentional about how tasks are scheduled across cores, they were able to get more predictable performance and better efficiency overall. It’s a good reminder that even with powerful multicore CPUs, the operating system still plays a huge role in how well everything actually performs.

English

3.3K

OS Dev@OSdev_·4d

usenix.org/legacy/publica…

ZXX

252

OS Dev@OSdev_·4d

It reminded me of the classic USENIX paper Overcoming Workstation Scheduling Problems in a Real-Time Audio Tool, which explains the same idea. The paper shows that real-time audio isn't limited by average CPU performance, it's limited by scheduler jitter and missed deadlines. Even if the CPU is mostly idle, one delayed wake-up can cause the playback buffer to underrun, producing an audible glitch. It also discusses the same tradeoff mentioned here: larger buffers reduce glitches but increase latency, while smaller buffers lower latency but leave much less room for scheduling delays. Nearly 30 years later, the core lesson still holds: for real-time audio, meeting every deadline matters more than being fast on average.

LaurieWired@lauriewired

Real Time Audio on general purpose operating systems is ridiculously hard to code. Unfortunately (and unlike the visual system!) humans are *really* good at noticing audio hitches. In video, you might have ~16ms to process a video frame, and if you miss the deadline, eh, just continue to display the existing frame. It’ll hitch…but a lot of people won’t notice. The eye is fairly forgiving, mostly deals in averages. If you miss a SINGLE audio sample (.00002 sec at 44.1khz!) it’s super obvious! The ear was basically made to detect discontinuities in waveforms; the real life equivalent would be like a twig snapping. The waveform collapse (single audio sample dropped) spreads energy across every frequency band at once, almost every hair cell in your cochlea fires! There’s not really a great way to fix this. You can sample and hold, (either just the sample or the whole buffer), but the splice to the next chunk will have a *very* audible seam. Smarter systems will crossfade, and then really intelligent protocols like modern bluetooth will attempt to pitch-bend the seam. But every single one of those “fixes” costs latency and CPU time…which you don’t have in real time audio!

English

2.1K

OS Dev@OSdev_·4d

@lauriewired Real-time audio is a great example of why "fast on average" isn't the same as "deterministic." Average CPU utilization can be low, but a single scheduling jitter or cache miss at the wrong moment is enough to create an audible glitch.

English

845

LaurieWired@lauriewired·4d

English

105

2.2K

89.9K

OS Dev@OSdev_·4d

@yarden_shafir I was waiting for this series. Thank you!

English

541

Yarden Shafir@yarden_shafir·4d

First post in the series: a quick explanation of the PreviousMode mitigation added in 23H2: windows-internals.com/random-windows…

Yarden Shafir@yarden_shafir

Gonna start a new blog series where I document small Windows features/changes/techniques that I couldn't find documented anywhere. I have a few ideas already but is there anything you'd like me to write about?

English

214

20.8K

OS Dev@OSdev_·4d

Most people think every context switch completely flushes the TLB. That's no longer true on modern CPUs. The TLB is a small cache that stores recent virtual-to-physical address translations, allowing memory accesses to avoid expensive page table walks. If it were flushed on every process switch, each new process would experience a burst of TLB misses, hurting performance. To avoid this, modern x86 processors use PCIDs (Process Context Identifiers) and ARM64 processors use ASIDs (Address Space Identifiers), which tag TLB entries with the address space they belong to. This allows the CPU to keep translations from multiple processes in the TLB at the same time and simply use the entries that match the currently running process. Context switches still have overhead because the operating system must save and restore CPU state and the scheduler must switch execution, but rebuilding the entire TLB after every switch is no longer part of the common case.

English

1.9K

OS Dev@OSdev_·4d

@zwclose Impressive !

English

276

ZwClose@zwclose·5d

A vulnerability in Realtek's card reader driver allows non-privileged users to program the DMA controller, enabling arbitrary physical memory reads and writes. Neither additional hardware nor a custom kernel driver is required to exploit the vuln. Details: zwclose.github.io/2026/07/08/rts…

English

133

9.7K

OS Dev@OSdev_·4d

usenix.org/system/files/s…

ZXX

330

OS Dev@OSdev_·4d

Many memory corruption exploits depend on one simple assumption: freed memory will eventually be reused. This paper asks a simple question: What if memory was never reused? The result is Fast Forward Allocation (FFmalloc), a practical allocator designed to stop Use-After-Free (UAF) attacks by making object reuse impossible. A Use-After-Free bug happens when a program continues to access an object after it has been freed. Attackers exploit this by quickly allocating new data into the same memory location, replacing the old object with one they control. If the dangling pointer is later dereferenced, it operates on attacker-controlled data instead of the original object. This primitive has been widely used to compromise browsers, kernels, and other memory-unsafe software. FFmalloc revives the idea of One-Time Allocation (OTA). Instead of recycling freed addresses, every allocation receives a new, unique virtual address. Since a freed object's address is never given to another allocation, attackers cannot reclaim that location, eliminating the key primitive required for traditional UAF exploitation. The challenge is that never reusing memory sounds expensive. The paper solves this with two practical techniques: batch page management, which reduces costly system calls by managing memory in larger chunks, and a hybrid allocator that combines bump-pointer allocation with fixed-size bins to reduce fragmentation while keeping allocation fast. The researchers implemented this design as FFmalloc and evaluated it on standard benchmarks and large real-world applications. Their prototype successfully blocked all tested Use-After-Free exploits while introducing moderate performance overhead, showing that preventing address reuse can be practical rather than just a theoretical idea. The biggest takeaway is that many UAF exploits don't require creating new vulnerabilities, they rely on predictable allocator behavior. By fundamentally changing how memory is reused, FFmalloc removes one of the attacker's most powerful exploitation primitives instead of trying to detect every individual bug.

English

4.6K

OS Dev@OSdev_·4d

usenix.org/system/files/c…

ZXX

228

OS Dev@OSdev_·4d

Most CPU side-channel defenses focus on isolating the cache. This paper showed that wasn't enough. TLBleed demonstrated that the CPU's Translation Lookaside Buffer (TLB) can also become a high-resolution side channel, allowing attackers to recover cryptographic secrets even when state-of-the-art cache isolation defenses are enabled. The TLB is a small hardware cache that stores recent virtual-to-physical address translations. Every memory access depends on it, making it a shared microarchitectural resource between processes running on the same CPU core. The researchers first reverse engineered how Intel's TLB maps addresses internally. They then built a new attack that monitors when translation entries are used rather than relying on cache activity. This temporal analysis made the attack practical even though the TLB exposes much less information than CPU caches. Their prototype, TLBleed, recovered a 256-bit EdDSA private key from a single execution with a 98% success rate after about 17 seconds of computation. It also reconstructed 92% of RSA keys from an implementation already hardened against FLUSH+RELOAD cache attacks. The biggest lesson is that modern CPUs contain many shared microarchitectural structures besides caches. Protecting only the cache does not eliminate side channels. TLBs, branch predictors, execution buffers, and other shared hardware can all become information leakage channels, which is why defending against microarchitectural attacks requires a system-wide approach instead of fixing one component at a time.

English

1.4K

OS Dev@OSdev_·4d

arxiv.org/pdf/2005.13435

ZXX

296

OS Dev@OSdev_·4d

Modern CPUs execute instructions before they know if those instructions should actually run. That design choice makes them incredibly fast but it also created an entirely new class of hardware attacks. This paper explains Transient Execution Attacks (TEAs), the family of attacks behind Spectre, Meltdown, Foreshadow, ZombieLoad, RIDL, Fallout, LVI, and many others. A transient execution attack has three main stages: 1. Create a transient execution window using branch misprediction, exceptions, faults, or other CPU events. 2. Access data that should normally be protected while the CPU is still executing transient instructions. 3. Encode the leaked value into a microarchitectural state most commonly the CPU cache so it can be recovered later through a timing side channel. The key idea is that the CPU eventually discards the incorrect instructions, but their microarchitectural side effects remain. Cache contents, branch predictor state, load/store buffers, TLBs, and other shared hardware structures can still reveal information that never became part of the architectural state. The survey classifies transient execution attacks by: • What creates the transient window (speculation, exceptions, faults, memory ordering, etc.) • Which hardware structure is used as the covert channel • Which security boundary is crossed, such as user ↔ kernel, process ↔ process, VM ↔ hypervisor, or SGX enclave ↔ host. It also compares attacks based on practical factors like attacker requirements, victim interaction, data leakage capability, and exploit feasibility, making it easier to understand why some attacks are more practical than others. One of the paper's biggest takeaways is that these vulnerabilities are not just software bugs. They are consequences of performance optimizations built into modern out-of-order processors. Fixing them requires changes across hardware, microcode, compilers, operating systems, and application software often with measurable performance costs.

English

104

3.4K

OS Dev@OSdev_·4d

arxiv.org/pdf/2606.24079

ZXX

290

OS Dev@OSdev_·4d

What if most of a virtual machine's memory snapshot never needed to be restored in the first place? That's the observation behind Aquifer, a research system for MicroVM-based serverless computing. When the authors analyzed MicroVM snapshots, they found that most pages are either all zeros or are rarely accessed during execution. Instead of treating every page the same, Aquifer stores only the pages that are likely to matter in faster memory. The system combines two memory tiers: • CXL provides low-latency, load/store access to a pod-local memory pool. • RDMA provides access to a larger cluster-wide memory pool, but with higher latency and software overhead. Aquifer removes zero-filled pages from the snapshot entirely, places the frequently accessed ("hot") pages in the CXL pool, and stores the remaining cold pages in the RDMA pool. Before a MicroVM resumes, hot pages are copied into memory, while cold pages are fetched asynchronously only if they're needed. The paper also addresses another practical challenge. Since CXL 2.0 multi-headed devices do not provide hardware cache coherence across hosts, Aquifer introduces an ownership-based coherence protocol to safely share snapshots between machines. On an emulated CXL+RDMA platform, Aquifer achieved a 2.2× geometric mean improvement in end-to-end serverless invocation time compared to Firecracker. As memory pooling becomes more common, simply adding another memory tier isn't enough. Where data is placed and when it is moved between memory tiers can have just as much impact on performance as the memory hardware itself.

English

OS Dev@OSdev_·5d

nozominetworks.com/blog/printnigh…

ZXX

324

OS Dev@OSdev_·5d

PrintNightmare is a great example of how a service that's enabled by default can become a critical attack surface. The vulnerability affected the Windows Print Spooler service and was tracked as CVE-2021-1675 and CVE-2021-34527. The confusion around these two CVEs, combined with the accidental release of a proof-of-concept before a complete patch was available, turned it into one of the most high-profile Windows vulnerabilities in recent years. The attack abused the RpcAddPrinterDriverEx functionality to load a malicious printer driver DLL. With valid credentials and the Print Spooler service enabled, an attacker could execute arbitrary code with SYSTEM privileges. Since the Print Spooler runs by default on many Windows systems including Domain Controllers the potential impact was enormous. What makes PrintNightmare worth studying isn't just the exploit. It highlights how RPC, driver loading, SMB shares, printer driver installation, and Windows privilege boundaries interact inside the operating system. It's an excellent case study in how legacy functionality, complex service design, and incomplete patches can combine into a major security incident.

English

3.2K

Keşfet

@Abhishekcur @lauriewired @yarden_shafir @zwclose @elonmusk @BarackObama @taylorswift13 @cristiano