Sabitlenmiş Tweet

I ended my time at @Meta as a director.
But I started as an engineer on FB Chat.
Everything about it was broken — we had to rewrite it.
And while the effort to fix it is one the projects that led to @reactjs, the most important fix was far simpler...
Here’s the full story:
—
I worked on Facebook Chat for several years, both on the front end and the infrastructure.
Before the major effort to redo the UI, FB Chat was super broken and we had no idea why.
We got tons of bug reports about Chat being broken every day, but we noticed an odd pattern in the data: the volume of reports didn’t match the volume of usage. It was time-shifted from the peaks we’d see in the US.
We didn’t know what was wrong, but we knew the code was a mess.
We set about rewriting both the front-end and the back-end in an effort to fix it.
The front-end rewrite pulled in a whole team of amazing engineers and became one of the big threads that led to @reactjs
In the public eye, we portrayed this project as the one that ultimately fixed Chat.
And the way I’ve usually told it, fixing Facebook Chat and the birth of React are the same story.
But no framework was going to fix the worst problem with Chat.
—
During the time we were working on the Chat rewrite, we were also replacing the original Erlang backend with one written in C++.
This was probably a good move, but the problem wasn’t with Erlang either.
Our initial spec for the new backend didn’t say much about observability, but it was an important feature, and the rewrite forced us to rebuild it.
Little did we know this would lead us to the root cause of our problems…
When we finally gained insight into our deliverability data, we were able to cut it by region.
We noticed Chat was really popular in India. This was before WhatsApp, at a time when SMS wasn’t reliable.
Eventually we pinpointed a region in India where one specific DNS provider was giving out the wrong IP addresses for our Chat servers.
So when people went to use Chat, they would sometimes get a notification that they had a message, and then it would disappear.
Or they’d send a message and it would get lost. All because they were connecting to the wrong IP address.
That was it!
None of the sexy new tech we were working on was going to solve that problem.
Ever.
—
Instead, the solution was to build observability that allowed us to track end-to-end message delivery.
In the end, we could start with a broad cut of our data by country or web browser, and then zoom all the way in to look at what happened to a specific message for a specific user.
Once we pinpointed that the problem was with a DNS server, the matter was resolved with a quick phone call. I don’t know what they did, but I imagine it was something like turning it off and turning it on again.
We sometimes talk about observability as if it’s enough to buy a product like Datadog and just look at the pretty graphs.
Sure, that’s a start.
But true observability is a feature that needs to be built— painstakingly, iteratively, by-definition starting with a shot in the dark.
—
These days, it has become fashionable to poo-poo the idea of being data-driven.
People point out that measurement can distort the phenomenon that is being observed.
They want to make processes “data-informed.”
But this seems like silly backlash against the only rigorous standard in all of software engineering:
That we hold ourselves to an objective standard.
We measure how long things take, how many errors we encounter, how often a process successfully runs to completion.
So here’s what this experience taught me about observability:
When an issue happens in production, time-box the investigation.
Sure, take a few hours to try and figure it out by looking in the logs and inspecting the code.
But if you’re coming to the end of the day and you still don’t have a fix, then push a PR that adds logging.
The first one may be just a guess, but it will begin a process that leads to the truth.
And that is what we should all ultimately be striving for.
—
For more engineering tips and stories, follow me @dmwlff

English





















