Redeem Grimm

55 posts

Redeem Grimm

@RedeemGrimmm

Distributed Data Systems Engineer | Apache Cassandra | Apache Kafka

Massachusetts, USA Katılım Ekim 2025

1 Takip Edilen1 Takipçiler

Redeem Grimm@RedeemGrimmm·6d

Great talk from Andre Balleyguier @AnthropicAI #ODSC

English

Redeem Grimm@RedeemGrimmm·23 Oca

Eventual consistency isn’t weaker correctness; it’s correctness under realistic failure assumptions. #DistributedSystems #Databases #Consistency

English

Redeem Grimm@RedeemGrimmm·23 Oca

The guarantee wasn’t “always correct now,” but if updates stop and communication resumes, replicas converge. This shifted correctness from strict ordering to causal reasoning and merge semantics, influencing Dynamo, Cassandra, and modern geo-distributed systems. (Continuation)

English

Redeem Grimm@RedeemGrimmm·23 Oca

Bayou (1995) embraced this instead of fighting it allowing updates on disconnected replicas, then reconciling later using version histories, dependency checks, and application-defined conflict resolution. #DistributedSystem #Technology

English

Redeem Grimm@RedeemGrimmm·23 Oca

Part 18 -Eventual Consistency (Bayou, 1990s) Eventual consistency emerged when systems accepted a hard truth: replicas will diverge under partitions and disconnections.

English

Redeem Grimm@RedeemGrimmm·22 Oca

Progress (liveness) is not guaranteed by theory (FLP), so real systems rely on timeouts + stable leadership (Multi-Paxos). Paxos isn’t about speed, it’s about making the only correct decision inevitable despite failures. #DistributedSystems #Consensus

English

Redeem Grimm@RedeemGrimmm·22 Oca

Phase 2 (Accept) can only move forward with a value that preserves safety. Because any two majorities overlap, once a value is chosen it can never be replaced safety is invariant under retries, reordering, and leader changes.

English

Redeem Grimm@RedeemGrimmm·22 Oca

Part 17 - Paxos (Lamport, 1998) solves consensus under crash faults by enforcing quorum intersection. Phase 1 (Prepare/Promise) ensures a proposer learns the highest accepted value from a majority;

English

Redeem Grimm@RedeemGrimmm·20 Oca

This result explains why real systems rely on timeouts, failure detectors, and partial synchrony: they intentionally weaken assumptions to make progress possible. Consensus works in practice because systems accept reality, not perfection. #DistributedSystems #Consensus #FLP

English

Redeem Grimm@RedeemGrimmm·20 Oca

Part 16 - Continuation The reason isn’t bad algorithms, it’s uncertainty. If messages can be delayed arbitrarily, the system can never distinguish “slow” from “failed,” so it can’t safely decide without risking disagreement.

English

Redeem Grimm@RedeemGrimmm·20 Oca

In a fully asynchronous system (no bounds on message delays, no synchronized clocks), perfect consensus is impossible if even one process can fail.

English

Redeem Grimm@RedeemGrimmm·20 Oca

Part 16 - The FLP Impossibility Result (Single Tweet) In 1985, Fischer, Lynch, and Paterson proved a result that reshaped distributed systems: 👇

English

Redeem Grimm@RedeemGrimmm·19 Oca

In distributed systems, bugs aren’t hidden in code alone they’re hidden in time, ordering, and interaction.

English

Redeem Grimm@RedeemGrimmm·19 Oca

Part 15 - Continuation Distributed snapshots made debugging possible by capturing a consistent global state one that could have actually occurred without stopping execution. This idea underpins modern observability, replay, checkpointing, and failure analysis.

English

Redeem Grimm@RedeemGrimmm·19 Oca

While one node is logging an error, others are processing messages, retrying requests, or failing silently, with messages still in flight. - Continuation

English

Redeem Grimm@RedeemGrimmm·19 Oca

Part 15 - Why debugging distributed systems is hard Debugging distributed systems is hard because you can’t pause the world. There is no global clock, no single state, and no moment where all nodes agree on “now.” #distributedsystems

English

Redeem Grimm@RedeemGrimmm·18 Oca

This idea became foundational for checkpointing, recovery, debugging, and observability in real distributed systems. Part 14 -Continuation

English

Redeem Grimm@RedeemGrimmm·18 Oca

Distributed snapshots (Chandy–Lamport) solve this by capturing a consistent global state without stopping the system recording local states and in-transit messages so the snapshot could have actually happened. #DistributedSystems #Reliability #Observability

English

Redeem Grimm@RedeemGrimmm·18 Oca

Part - 14 In a distributed system, there is no single moment in time when you can pause everything and ask, “What is the system state right now?” Messages are always in flight, clocks aren’t synchronized, and nodes observe the world differently.

English

Redeem Grimm@RedeemGrimmm·17 Oca

This paper drew a sharp line between what’s possible and what’s practical in distributed systems and why most databases assume crash faults, not Byzantine ones. #DistributedSystems #FaultTolerance #Consensus

English

Redeem Grimm@RedeemGrimmm·17 Oca

Their result proved a hard limit: tolerating Byzantine faults requires more replicas, more coordination, and higher cost (e.g., 3f+1 nodes to tolerate f faults).

English

Keşfet

@AnthropicAI @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine