Redeem Grimm

55 posts

Redeem Grimm

Redeem Grimm

@RedeemGrimmm

Distributed Data Systems Engineer | Apache Cassandra | Apache Kafka

Massachusetts, USA Katılım Ekim 2025
1 Takip Edilen1 Takipçiler
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
The guarantee wasn’t “always correct now,” but if updates stop and communication resumes, replicas converge. This shifted correctness from strict ordering to causal reasoning and merge semantics, influencing Dynamo, Cassandra, and modern geo-distributed systems. (Continuation)
English
1
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Bayou (1995) embraced this instead of fighting it allowing updates on disconnected replicas, then reconciling later using version histories, dependency checks, and application-defined conflict resolution. #DistributedSystem #Technology
English
0
0
1
10
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 18 -Eventual Consistency (Bayou, 1990s) Eventual consistency emerged when systems accepted a hard truth: replicas will diverge under partitions and disconnections.
English
1
0
1
10
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Progress (liveness) is not guaranteed by theory (FLP), so real systems rely on timeouts + stable leadership (Multi-Paxos). Paxos isn’t about speed, it’s about making the only correct decision inevitable despite failures. #DistributedSystems #Consensus
English
0
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Phase 2 (Accept) can only move forward with a value that preserves safety. Because any two majorities overlap, once a value is chosen it can never be replaced safety is invariant under retries, reordering, and leader changes.
English
0
0
1
11
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 17 - Paxos (Lamport, 1998) solves consensus under crash faults by enforcing quorum intersection. Phase 1 (Prepare/Promise) ensures a proposer learns the highest accepted value from a majority;
English
1
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
This result explains why real systems rely on timeouts, failure detectors, and partial synchrony: they intentionally weaken assumptions to make progress possible. Consensus works in practice because systems accept reality, not perfection. #DistributedSystems #Consensus #FLP
English
0
0
1
17
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 16 - Continuation The reason isn’t bad algorithms, it’s uncertainty. If messages can be delayed arbitrarily, the system can never distinguish “slow” from “failed,” so it can’t safely decide without risking disagreement.
English
1
0
1
13
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
In a fully asynchronous system (no bounds on message delays, no synchronized clocks), perfect consensus is impossible if even one process can fail.
English
0
0
1
17
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 16 - The FLP Impossibility Result (Single Tweet) In 1985, Fischer, Lynch, and Paterson proved a result that reshaped distributed systems: 👇
English
1
0
1
21
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
In distributed systems, bugs aren’t hidden in code alone they’re hidden in time, ordering, and interaction.
English
0
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 15 - Continuation Distributed snapshots made debugging possible by capturing a consistent global state one that could have actually occurred without stopping execution. This idea underpins modern observability, replay, checkpointing, and failure analysis.
English
1
0
1
17
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
While one node is logging an error, others are processing messages, retrying requests, or failing silently, with messages still in flight. - Continuation
English
0
0
1
13
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part 15 - Why debugging distributed systems is hard Debugging distributed systems is hard because you can’t pause the world. There is no global clock, no single state, and no moment where all nodes agree on “now.” #distributedsystems
English
1
0
1
13
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
This idea became foundational for checkpointing, recovery, debugging, and observability in real distributed systems. Part 14 -Continuation
English
0
0
1
12
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Distributed snapshots (Chandy–Lamport) solve this by capturing a consistent global state without stopping the system recording local states and in-transit messages so the snapshot could have actually happened. #DistributedSystems #Reliability #Observability
English
0
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Part - 14 In a distributed system, there is no single moment in time when you can pause everything and ask, “What is the system state right now?” Messages are always in flight, clocks aren’t synchronized, and nodes observe the world differently.
English
1
0
1
14
Redeem Grimm
Redeem Grimm@RedeemGrimmm·
Their result proved a hard limit: tolerating Byzantine faults requires more replicas, more coordination, and higher cost (e.g., 3f+1 nodes to tolerate f faults).
English
1
0
1
13