Maged Michael

57 posts

Maged Michael banner
Maged Michael

Maged Michael

@MagedMMichael

@category_xyz, ex @facebook, @ibm. Concurrent algorithms. High-throughput. Low-latency.

Katılım Temmuz 2013
103 Takip Edilen247 Takipçiler
Maged Michael
Maged Michael@MagedMMichael·
@builnad @kkuehlz @_jhunsaker 4 & 256 is the current recommended setting. I haven't measured thr and fiber variations recently. Last I checked 5 and 4 were basically equal so might as well go with 4 until 5 is decidedly better. Usually the difference between N and N+1 threads is only a few % points of tput.
English
2
0
3
109
builnad
builnad@builnad·
@MagedMMichael @kkuehlz @_jhunsaker Then, assuming we use the currently recommended specs, the most ideal option for handling 10,000 TPS would be 256 fibers and 4 threads, right? Considering the database bottleneck.
English
1
0
2
123
builnad
builnad@builnad·
I have one question for @_jhunsaker When actually configuring the monad execution binary, you can see AllowedCPUs and CPUAffinity. This determines how many CPUs are allocated for actual parallel execution. The default is 1-7 cores, where cores 1 and 2 are used for io_using and core 7 is used as an auxiliary core. This leaves 4 cores available. Looking at some options in the code, there are `nthreads` and `nfibers`. These determine how many physical threads to use on the allocated cores and reserve separate space for thread scheduling tasks. nthreads determines the actual number of cores for parallel execution. Its current default is “4,” likely the optimal number considering various scenarios. I understand that using a typical entry-level server deployed across multiple clouds as a validator, aiming for decentralization, makes a 16-core server the right choice (considering BFT, OS, and RPC together). But it's said that 4 cores can handle 10,000 TPS. Would increasing the core allocation to 8 improve performance further? (like 15,000~20,000 TPS?) So, if using a server with more cores at the same core clock speed, would parallel processing performance improve? I'm curious how much room for performance improvement Monad has here. @monad
English
1
0
24
3.8K
Maged Michael
Maged Michael@MagedMMichael·
@builnad @kkuehlz @_jhunsaker Although we wouldn't want to go too far that running more txns in parallel with tx N makes it slower. It's more important to get tx N done than to allow tx N+500 to start.
English
1
0
4
111
Maged Michael
Maged Michael@MagedMMichael·
@builnad @kkuehlz @_jhunsaker Yes but not always. Depends on the improvement. For ex an improvement that makes the same txn hide io latency doesn't need to increase fibers. It makes sense to increase fibers if cores are idle while tx N+256 can't run because tx N is not yet completed.
English
1
0
3
69
Maged Michael
Maged Michael@MagedMMichael·
@kkuehlz @_jhunsaker @builnad The more io latency is reduced or hidden the more throughput will improve with more threads. Potential improvements in the works could bump tx exec threads up to 5 or 6 for higher throughput.
English
1
0
5
77
kevin (idea guy)
kevin (idea guy)@kkuehlz·
@_jhunsaker @builnad @MagedMMichael looked into bumping thread counts late last year. don't think he saw notable tput improvements. of course, evm is generic, and some workloads would certainly benefit, although most are io bound.
English
1
0
6
162
Maged Michael
Maged Michael@MagedMMichael·
@0x_eunice @CppCon Thank you, Eunice! The talks will be posted by the conference on YouTube in a few weeks. I’ll post a link here. <3
English
0
0
1
66
Maged Michael
Maged Michael@MagedMMichael·
Yes, sort of. The hazard pointer technique is suitable for use in frequent operations to protect access to dynamic objects (the access can be read-only or read-write) against (less frequent) concurrent removal of such objects. The ratio of frequent to infrequent operations doesn't need to be too high. 10:1 is enough to make it worthwhile. Examples of frequent operations are hash map lookups and dynamic queue push/pops. The corresponding infrequent operations would be deletions of items from hash maps and deletions of queue segments.
English
0
0
0
52
shachaf
shachaf@shachaf·
@MagedMMichael This is paying a much larger cost on the writer side, though! Presumably it's only suitable for things that are written infrequently?
English
1
0
0
92
Maged Michael
Maged Michael@MagedMMichael·
Hazard pointer protection is super fast, but is it scalable too? Yes. it is. Hazard pointer protection is effectively perfectly scalable: • Line 8: Writing to the hazard pointer almost never generates cache coherence traffic. Hazard pointers are padded and properly aligned to avoid false sharing among writes to different hazard pointers. Each hazard pointer is only written by one thread at a time (its owner) and is read very rarely (by a reclaimer). The ratio of writes (by owner) to reads (by a reclaimer) is in practice at worst in the order of 1,000:1 and more typically a lot higher (e.g., one write per lookup in a concurrent hash map vs one read per amortized reclamation of thousands of deleted items). • Line 10: The lightweight asymmetric memory barrier doesn't generate any instructions. • Line 12: Re-reading the pointer from the source is almost always an L1 cache hit and so it doesn't generate cache coherence traffic. • Line 14: Perfectly scalable local computation that doesn't involve memory access or synchronization.
Maged Michael tweet media
Maged Michael@MagedMMichael

How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:

English
1
1
10
631
Maged Michael
Maged Michael@MagedMMichael·
@davidtgoldblatt @paulmckrcu Operations on Folly ConcurrentHashMap use at most 3 hazard pointers, one for the bucket, 2 for hand-over-hand traversal. The iterator has 3 HPs, one to protect the current bucket, one to protect the current kv node, the 3rd is just for convenient traversal on '++it'.
English
1
0
2
101
David Goldblatt
David Goldblatt@davidtgoldblatt·
@paulmckrcu @MagedMMichael Whereas it seems like we should be able to say, "OK, RCU opportunistically, then transition whatever our pointer working set is if we happen to start taking a long time" (just for the guarded item we're holding an iterator to).
English
1
0
1
164
Maged Michael
Maged Michael@MagedMMichael·
How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:
Maged Michael tweet media
English
1
2
13
2.1K
Paul E. McKenney
Paul E. McKenney@paulmckrcu·
@MagedMMichael @davidtgoldblatt Suppose we have a search structure not amenable to current hazard-pointer wait-free tricks leading to nodes requiring long-term protection. Then RCU protects the search structure and hazard pointers protects the destination node. David, a hazard-pointers/RCU example for you!
English
2
0
4
229
Maged Michael
Maged Michael@MagedMMichael·
My point was the opposite. This part of the conversation started with @davidtgoldblatt wondering why nobody is doing hybrid RCU/HP. I understood that to mean: (1) enter RCU critical section, (2) handle some pointers, (3) decide that some pointers need protection beyond this critical section, (4) protect these pointers using HP, (5) exit RCU critical section. So I replied that my guess is that the performance cost of just remembering which pointers may need to be protected by HP-s may be on par with the perf cost of protecting pointers with HP-s from the start. Then you, Paul, not me, brought up the similarity to read-write locks and ref counting. So I replied that while these are functionally similar, the relative performance costs are different. After some back and forth I referred to the example of (1) acquire shared mutex, (2) handle some shared pointers, (3) decide that some shared ptr refs need to be retained beyond the end of the critical section, (4) copy these shared ptrs, (5) release the shared mutex. My intent for this example was to be similar functionally to the hybrid RCU/HP case that IIUC David meant, but with different relative performance costs. My guess is using hybrid shared mutex / shared ptrs is faster than pure atomic shared ptrs, whereas I doubt that hybrid RCU/HP is faster than pure HP (because of the relative cost of remembering which pointers may need HP protection is on par with outright HP protection in the first place, but that's not the case for the loading from atomic shared ptrs, which is relatively expensive). So this is how we got in this rabbit hole. I hope I didn't make things more confusing :) x.com/davidtgoldblat…
English
1
0
1
90
Paul E. McKenney
Paul E. McKenney@paulmckrcu·
@MagedMMichael @davidtgoldblatt OK, so this use case involves both a reader-writer lock and an explicit reference count. And so depending on what else is going on, the solution might be best served by both hazard pointers and RCU. Other use cases might be well served by one or the other. ;-)
English
2
0
1
161
Maged Michael
Maged Michael@MagedMMichael·
Copying from same shared_ptr is thread safe. Each copy involves atomic increment of the use count in the control block associated with the protected object. However, we need the shared mutex to protect from concurrent modification of the shared_ptr being copied from (which should acquire the lock exclusively).
English
1
0
0
91
Paul E. McKenney
Paul E. McKenney@paulmckrcu·
@MagedMMichael @davidtgoldblatt Can two concurrent instances of read_and_protect() be passed the same mutex and shared pointer? If so, aren't we violating the shared-pointer contract? Either way, what exactly is the shared mutex protecting?
English
1
0
0
118
Maged Michael
Maged Michael@MagedMMichael·
I was thinking of something like this where it is probably faster to acquire a lock and copy a shared_ptr than to read an atomic shared_ptr without the lock.
Maged Michael tweet media
English
1
0
1
105
Paul E. McKenney
Paul E. McKenney@paulmckrcu·
@MagedMMichael @davidtgoldblatt This example uses both reader-writer locking and reference counting (in the guise of the shared pointer, right?), so it could go either way. Of course, having an actual sample of the code would help.
English
1
0
1
134
Maged Michael retweetledi
Keone Hon
Keone Hon@keoneHD·
This is a big milestone: half of the codebase, implementing MonadBFT, RaptorCast, blocksync, statesync, mempool, etc. is open source! This codebase is a treasure trove of performant systems engineering work. We hope this materially pushes the space forward. Step by step.
James@_jhunsaker

Monad consensus client is now open source (link below). This is the result of thousands of hours of effort by the team at @category_xyz. Enjoy

English
733
770
3K
131K
Maged Michael
Maged Michael@MagedMMichael·
The cat is making Monad faster and faster.
Maged Michael tweet media
English
0
1
7
136