Maged Michael

57 posts

Maged Michael

@MagedMMichael

@category_xyz, ex @facebook, @ibm. Concurrent algorithms. High-throughput. Low-latency.

Katılım Temmuz 2013

103 Takip Edilen247 Takipçiler

Maged Michael retweetledi

Keone Hon@keoneHD·12 Mar

Awesome explainer from @MagedMMichael on Monad's caching mechanism, which handles a problem specific to blockchain: caching state key-value pairs for pre-finalized blocks

Category Labs@category_xyz

x.com/i/article/2031…

English

203

11.7K

Maged Michael@MagedMMichael·12 Mar

Describing Monad’s key-value cache design and how it handles undecided blocks.

Category Labs@category_xyz

x.com/i/article/2031…

English

179

Maged Michael@MagedMMichael·7 Kas

@builnad @kkuehlz @_jhunsaker 4 & 256 is the current recommended setting. I haven't measured thr and fiber variations recently. Last I checked 5 and 4 were basically equal so might as well go with 4 until 5 is decidedly better. Usually the difference between N and N+1 threads is only a few % points of tput.

English

109

builnad@builnad·7 Kas

@MagedMMichael @kkuehlz @_jhunsaker Then, assuming we use the currently recommended specs, the most ideal option for handling 10,000 TPS would be 256 fibers and 4 threads, right? Considering the database bottleneck.

English

123

builnad@builnad·6 Kas

I have one question for @_jhunsaker When actually configuring the monad execution binary, you can see AllowedCPUs and CPUAffinity. This determines how many CPUs are allocated for actual parallel execution. The default is 1-7 cores, where cores 1 and 2 are used for io_using and core 7 is used as an auxiliary core. This leaves 4 cores available. Looking at some options in the code, there are `nthreads` and `nfibers`. These determine how many physical threads to use on the allocated cores and reserve separate space for thread scheduling tasks. nthreads determines the actual number of cores for parallel execution. Its current default is “4,” likely the optimal number considering various scenarios. I understand that using a typical entry-level server deployed across multiple clouds as a validator, aiming for decentralization, makes a 16-core server the right choice (considering BFT, OS, and RPC together). But it's said that 4 cores can handle 10,000 TPS. Would increasing the core allocation to 8 improve performance further? (like 15,000~20,000 TPS?) So, if using a server with more cores at the same core clock speed, would parallel processing performance improve? I'm curious how much room for performance improvement Monad has here. @monad

English

3.8K

Maged Michael@MagedMMichael·7 Kas

@builnad @kkuehlz @_jhunsaker Although we wouldn't want to go too far that running more txns in parallel with tx N makes it slower. It's more important to get tx N done than to allow tx N+500 to start.

English

111

Maged Michael@MagedMMichael·7 Kas

@builnad @kkuehlz @_jhunsaker Yes but not always. Depends on the improvement. For ex an improvement that makes the same txn hide io latency doesn't need to increase fibers. It makes sense to increase fibers if cores are idle while tx N+256 can't run because tx N is not yet completed.

English

Maged Michael@MagedMMichael·6 Kas

@kkuehlz @_jhunsaker @builnad The more io latency is reduced or hidden the more throughput will improve with more threads. Potential improvements in the works could bump tx exec threads up to 5 or 6 for higher throughput.

English

kevin (idea guy)@kkuehlz·6 Kas

@_jhunsaker @builnad @MagedMMichael looked into bumping thread counts late last year. don't think he saw notable tput improvements. of course, evm is generic, and some workloads would certainly benefit, although most are io bound.

English

162

Maged Michael@MagedMMichael·19 Eyl

@0x_eunice @CppCon Thank you, Eunice! The talks will be posted by the conference on YouTube in a few weeks. I’ll post a link here. <3

English

Eunice | Monad Foundation (mainnet arc)@0x_eunice·19 Eyl

@MagedMMichael @CppCon Is there a recording? Would love to watch <3

English

326

Maged Michael@MagedMMichael·17 Eyl

At @CppCon

700

Maged Michael@MagedMMichael·16 Eyl

Proud to be part of the team @category_xyz

James@_jhunsaker

Monad execution client is now open source (link below). This is the result of thousands of hours of effort by the team at @category_xyz. Enjoy

English

541

Maged Michael@MagedMMichael·11 Eyl

Yes, sort of. The hazard pointer technique is suitable for use in frequent operations to protect access to dynamic objects (the access can be read-only or read-write) against (less frequent) concurrent removal of such objects. The ratio of frequent to infrequent operations doesn't need to be too high. 10:1 is enough to make it worthwhile. Examples of frequent operations are hash map lookups and dynamic queue push/pops. The corresponding infrequent operations would be deletions of items from hash maps and deletions of queue segments.

English

shachaf@shachaf·11 Eyl

@MagedMMichael This is paying a much larger cost on the writer side, though! Presumably it's only suitable for things that are written infrequently?

English

Maged Michael@MagedMMichael·11 Eyl

Hazard pointer protection is super fast, but is it scalable too? Yes. it is. Hazard pointer protection is effectively perfectly scalable: • Line 8: Writing to the hazard pointer almost never generates cache coherence traffic. Hazard pointers are padded and properly aligned to avoid false sharing among writes to different hazard pointers. Each hazard pointer is only written by one thread at a time (its owner) and is read very rarely (by a reclaimer). The ratio of writes (by owner) to reads (by a reclaimer) is in practice at worst in the order of 1,000:1 and more typically a lot higher (e.g., one write per lookup in a concurrent hash map vs one read per amortized reclamation of thousands of deleted items). • Line 10: The lightweight asymmetric memory barrier doesn't generate any instructions. • Line 12: Re-reading the pointer from the source is almost always an L1 cache hit and so it doesn't generate cache coherence traffic. • Line 14: Perfectly scalable local computation that doesn't involve memory access or synchronization.

Maged Michael@MagedMMichael

How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:

English

631

Maged Michael@MagedMMichael·10 Ağu

@davidtgoldblatt @paulmckrcu Operations on Folly ConcurrentHashMap use at most 3 hazard pointers, one for the bucket, 2 for hand-over-hand traversal. The iterator has 3 HPs, one to protect the current bucket, one to protect the current kv node, the 3rd is just for convenient traversal on '++it'.

English

101

David Goldblatt@davidtgoldblatt·8 Ağu

@paulmckrcu @MagedMMichael Whereas it seems like we should be able to say, "OK, RCU opportunistically, then transition whatever our pointer working set is if we happen to start taking a long time" (just for the guarded item we're holding an iterator to).

English

164

Maged Michael@MagedMMichael·29 Tem

English

2.1K

Maged Michael@MagedMMichael·10 Ağu

@paulmckrcu @davidtgoldblatt Yes. This seems like a good use case for hybrid RCU/HP.

English

Paul E. McKenney@paulmckrcu·8 Ağu

@MagedMMichael @davidtgoldblatt Suppose we have a search structure not amenable to current hazard-pointer wait-free tricks leading to nodes requiring long-term protection. Then RCU protects the search structure and hazard pointers protects the destination node. David, a hazard-pointers/RCU example for you!

English

229

Maged Michael@MagedMMichael·8 Ağu

My point was the opposite. This part of the conversation started with @davidtgoldblatt wondering why nobody is doing hybrid RCU/HP. I understood that to mean: (1) enter RCU critical section, (2) handle some pointers, (3) decide that some pointers need protection beyond this critical section, (4) protect these pointers using HP, (5) exit RCU critical section. So I replied that my guess is that the performance cost of just remembering which pointers may need to be protected by HP-s may be on par with the perf cost of protecting pointers with HP-s from the start. Then you, Paul, not me, brought up the similarity to read-write locks and ref counting. So I replied that while these are functionally similar, the relative performance costs are different. After some back and forth I referred to the example of (1) acquire shared mutex, (2) handle some shared pointers, (3) decide that some shared ptr refs need to be retained beyond the end of the critical section, (4) copy these shared ptrs, (5) release the shared mutex. My intent for this example was to be similar functionally to the hybrid RCU/HP case that IIUC David meant, but with different relative performance costs. My guess is using hybrid shared mutex / shared ptrs is faster than pure atomic shared ptrs, whereas I doubt that hybrid RCU/HP is faster than pure HP (because of the relative cost of remembering which pointers may need HP protection is on par with outright HP protection in the first place, but that's not the case for the loading from atomic shared ptrs, which is relatively expensive). So this is how we got in this rabbit hole. I hope I didn't make things more confusing :) x.com/davidtgoldblat…

English

Paul E. McKenney@paulmckrcu·8 Ağu

@MagedMMichael @davidtgoldblatt OK, so this use case involves both a reader-writer lock and an explicit reference count. And so depending on what else is going on, the solution might be best served by both hazard pointers and RCU. Other use cases might be well served by one or the other. ;-)

English

161

Maged Michael@MagedMMichael·6 Ağu

Copying from same shared_ptr is thread safe. Each copy involves atomic increment of the use count in the control block associated with the protected object. However, we need the shared mutex to protect from concurrent modification of the shared_ptr being copied from (which should acquire the lock exclusively).

English

Paul E. McKenney@paulmckrcu·6 Ağu

@MagedMMichael @davidtgoldblatt Can two concurrent instances of read_and_protect() be passed the same mutex and shared pointer? If so, aren't we violating the shared-pointer contract? Either way, what exactly is the shared mutex protecting?

English

118

Maged Michael@MagedMMichael·6 Ağu

I was thinking of something like this where it is probably faster to acquire a lock and copy a shared_ptr than to read an atomic shared_ptr without the lock.

English

105

Paul E. McKenney@paulmckrcu·6 Ağu

@MagedMMichael @davidtgoldblatt This example uses both reader-writer locking and reference counting (in the guise of the shared pointer, right?), so it could go either way. Of course, having an actual sample of the code would help.

English

134

Maged Michael retweetledi

tunez@cryptunez·15 Tem

x.com/i/article/1945…

ZXX

121

153.6K

Maged Michael retweetledi

Keone Hon@keoneHD·4 Ağu

This is a big milestone: half of the codebase, implementing MonadBFT, RaptorCast, blocksync, statesync, mempool, etc. is open source! This codebase is a treasure trove of performant systems engineering work. We hope this materially pushes the space forward. Step by step.

James@_jhunsaker

Monad consensus client is now open source (link below). This is the result of thousands of hours of effort by the team at @category_xyz. Enjoy

English

733

770

131K

Maged Michael@MagedMMichael·4 Ağu

The cat is making Monad faster and faster.

English

136

Keşfet

@builnad @kkuehlz @_jhunsaker @monad @0x_eunice @CppCon @category_xyz @davidtgoldblatt