Tuesday, October 19, 2021

Strange Loop 2021 slides and transcript: “Functional distributed systems beyond request/response”

A short talk at Strange Loop 2021 in St. Louis, MO (October 1, 2021).

Video

Slides and transcript

Hi everyone, thank you all so much for coming to this talk. I'm so happy we've survived this year. Hope you're all enjoying Strange Loop so far — it's so rare for us to be able to share space and mindspace, so thank you all for making this happen, against the odds. I'm Melinda, and I'm here to talk about distributed systems beyond request/response. (This is where I tweet about my partner's Lego.)

I work at eggy, where we’re working on making table-stakes distributed systems accessible to everyone. We’re also working on trying to scale carbon capture to gigatons of carbon dioxide per year by 2030. Please talk to me afterward in person or on Slack if you’re interested! Full disclosure, I’m currently nursing a brain injury from an unexpected fistfight, so please forgive me if I’m less than articulate these couple days. And before this year I worked at VSCO and some other places.

If we’re writing software that touches the internet today, there’s no getting around it: we’re building a distributed system.

Some parts of our software are running on mobile devices, or in a web browser, maybe; some parts are running on the server — probably many parts are running on many servers.

(And this is true even if we don’t have “servers”. If we have multiple computing devices and want them to share underlying data in any way, like, say, airdropping photos back and forth to have all our photos available on a device, that’s also a distributed system.)

And so all of these parts are calling each other over a network, which may or may not be working, or might be running out of batteries or have an exploded hard drive, at any point in time.

But our software is expected to stay up and running perfectly, even though any of its parts might be gone.

That’s difficult. Internet-based software is inherently distributed, and distributed systems are difficult to design, build, understand, and operate.

But it’s totally manageable if we understand the constraints of this context, and if we build the tools to make this tractable; if we build the ways of reasoning to make this context feel natural.

When we build distributed systems, we’re designing for three main concerns:

First, reliability: how to make sure the system continues to work correctly and performantly, when we have hardware or software faults, or human error.

Maintainability: how to make sure that all the different people who are going to work on our system over time can understand and work on it productively.

And scalability: how to make sure we have reasonable ways to deal with our systems growing.

But one overarching concern is how we keep complexity under control.

When we start building something, it can be simple and delightful and clean; but as our projects get bigger, they often become complex, and difficult to understand.

And this complexity slows down everyone who needs to work on the system.

We end up with tangled dependencies, hacks or special-casing to work around problems, an explosion in how much state we’re keeping — I’m sure you’ve seen all of these things.

One way we try to manage complexity, that’s been in vogue in our industry the last few years, is by breaking our software down into independently-deployable services.

This has many advantages: services can be worked on in parallel, each service can focus on a single domain, we have flexibility in how we solve larger problems, and so on.

But this is only true if we decompose our software in the right way, and use patterns that fit our context.

If we aren’t careful, we could end up with software that’s still tightly coupled, but now has an unreliable network between each piece of business logic, and is hard to deploy — and impossible to debug.

Okay, so, why is this?

One key property of our software architecture is the basic interaction style we choose — how our microservices interact with each other.

(Because we’ve broken our software down into pieces now, our system only works when multiple services work together.)

The default way that we typically do interaction nowadays is request/response, using synchronous network calls.

This model tries to make requests to a remote network service look like calling functions in a programming language, within the same process.

This seems convenient at first, but seeing remote calls as function calls is fundamentally flawed.

A network request is really different from a local function call.

A local function call is predictable and either succeeds or fails, depending on parameters that are totally under our control.

A network request is unpredictable: it could return successfully, albeit much more slowly than a function call. It could return successfully, but way more slowly — the network could be congested or the remote service could be overloaded, so it might take 20 seconds or 5 minutes to do the exact same thing. The request or response could get lost due to a network problem, or the remote service could be unavailable, or we could hit a timeout without receiving a result — and we would have no idea what happened.

If we don’t receive a response from our remote service, we have no way of knowing whether the request got through or not. Maybe the requests are actually getting through, and only the responses are getting lost — so it’s probably not safe for us to retry it, unless we’ve built in a mechanism for deduplication.

This is a rough environment to be in, really different from running on a single computer.

And because of this, our distributed software that interacts using request/response can be incredibly hard to reason about.

Imagine we have on the order of 100 or 1000 services communicating with each other in a point-to-point way: imagine how many different point-to-point connections that is.

And this is what a large distributed service architecture can look like. This is (essential streaming service and sponsor) Netflix’s architecture circa 2015, the series of lines on the right, from a re:Invent talk showing a network of services and their synchronous, run-time dependencies on other services. (Some people refer to this representation as a “death star” diagram.)

We can see that as the number of services grows over time, the number of synchronous interactions grows with them somewhere between linearly and quadratically.

And if we zoom into one service, this was the call graph for a single Netflix service, from the same talk. (Sorry to keep talking about this one example from 2015, it just had lots of nice diagrams.) This was the list of list of movies service, and these were all the services it had to talk to in order to return a response.

We can see that this one service depends on many other ones, and any of these n dependencies being down can cause availability issues for our one service.

And apart from availability, we have latency which grows with the depth of the dependency graph; we probably need to coordinate read times between when each of these is querying its datastore — and we are likely going to find it extremely difficult to reason about the state of the system at any point in time.

There’s a lot going on —

So much so that the only reasonable approach is often not to reason about this dependency graph at all, and to just work to make every dependency significantly more available than necessary,

A service can’t be more available than the intersection of all of its critical dependencies, so for example, if we want our service to offer 99.99 percent availability, then we can just make our critical dependencies much more than 99.99 percent available.

This approach is discussed in depth in this paper from engineers at Google.

Or we could wait and see how this system performs in practice, and diagnose issues as they arise.

Which in practice is fun, like a murder mystery, because it entails digging through tons of complexity, sometimes while a pager is going off next to me. But it’s suboptimal.

We can also do things to protect our request/response services, like: make automatic request retries when we don’t get a response back, with backpressure; add caches in between clients and servers; have multiple redundant copies of services that can take requests when one copy is broken, with readiness and liveness checks to route to the right ones; dynamic service discovery, load balancing, circuit breaking, and other traffic control to make this all possible; service meshes like Linkerd and Istio to make this more transparent to the user.

These are standard patterns for software that runs on the Internet. But these are standard patterns because trusting request/response to work, in a hostile environment like the cloud, is inherently risky.

So much of the mind share in the microservices landscape — the libraries and tooling, the documentation, blog posts and books and diagrams, assumes this model: of services calling each other synchronously.

And I think it’s a little unfortunate that we so often conflate the core values and goals of microservices with a specific model — request/response — for their implementation.

(This is also a hostile environment.)

But, what if, instead of patching these solutions for resilience on top of request/response, we could change the problem to a simpler one, by not binding our services together with synchronous ties?

I see this as analogous to the divide between imperative and functional programming.

The request/response model of service communication matches our sequential, imperative programming model of making in-place state changes on a single computer. And we’ve seen that it isn’t perfectly suited to a distributed environment.

Functional programming, in contrast, describes behavior in terms of the immutable input and output values of pure functions, not in terms of mutating objects in place.

If we model the way our services communicate more functionally, what would our systems look like?

We’ve seen in the last few years how thinking about things in a more functional way helps in other parts of the stack.

On the web, we’ve had reactive frameworks like React and Redux and Vue (and Elm before that) that almost everyone has shifted to after seeing how they can simplify things.

On mobile, the newest generation of UI frameworks on both iOS and Android is functional and reactive, and we’ve had libraries for reactivity in business logic for a while.

On the infrastructure side, we’ve seen how declarative APIs like Kubernetes, infrastructure-as-code tools like Terraform, and deployment via git make things much easier to reason about.

But we’re still early in adopting this in backend application development.

What would it look like if we extended the functional programming analogy to a microservices architecture?

This is the idea behind event-driven architecture.

Like functional programming, event-driven architecture lets us know, consistently, the state of a system, as long as we know its input events.

There are lots of different definitions of the details of what can be called an event-driven system, but the common piece is that the thing that triggers code execution in an event-driven system doesn’t wait for any response: it just sends it and then forgets about it.

So in this diagram, on the left, there’s a request and response: when a request is received, the receiving server has to return some type of a response to the client. Both the client and the server have to be alive, healthy, and focusing on each other in order for this to work.

On the other hand, on the right, there’s an event-driven service: the service that consumes the event still does something with it, yes, but it doesn’t have to send any response to the thing that originally created the event that triggered it. Only this service has to be alive for this code to be run; it doesn’t depend on other services being there. It’s not coupled to the thing that produced the event; it doesn’t have to know about the producer at all.

In an event-driven service, we center the transformation (functional), not the interaction (stateful).

And the way this usually looks in practice is that we use a message broker or distributed log to store these events. These days this is usually Kafka or Kinesis or something like that.

Events are split into different streams called topics, which get partitioned across a horizontally-scalable cluster. Each of our services can consume from one or more topics, produce to one or more topics, or do both or neither.

And when we compose many of these event-driven services together and have them respond to inputs from the outside world, this makes up a reactive system.

But so moving to an event-driven, more functional and data-centric model, is a really different way of having programs interact. And if we’re going to make this change, we’d better have some good reasons.

The biggest reason, I think, is the decoupling and easy composability — the modularity — that this gives us.

By getting rid of these point-to-point synchronous flows, our system can become much simpler.

Instead of having the work to add things scale quadratically, because every new thing has to be hooked up to every existing thing…

…it becomes a simpler task to plug new services in, or change services that are already there.

We can break these long chains of blocking commands, where one piece being down at any given time means all of these are unavailable.

And we do this by using a distributed log — a partitioned, append-only log — that all producers and consumers use as a central pipeline, that provides strong ordering semantics and durability guarantees.

And zooming out to a whole organization, these design principles let us build composable systems at a large scale.

In a large org, different teams can independently design and build services that consume from topics and produce to new topics.

Since a topic can have any number of independent consumers, no coordination is required to set up a new consumer. And consumers can be deployed at will, or built with different technologies, be it different languages and libraries or different datastores.

Each team focuses on making their particular part of the system do its job well.

As a real-life example, this is how this worked at VSCO, where I worked before this year. (VSCO is a photo- and video-sharing consumer platform where users have their own profile, and can follow other users and get their content in their feed, you know — a social network.)

What happens if a user follows another user in the app? The first thing that happens is they hit a synchronous follow service, which then tries to get that information into the distributed log (Kafka) as soon as possible.

And once it’s there, it can be used for many things: it gets consumed by the feed fanout worker so the follower can get the followee’s posts injected into their feed; it gets consumed by the push notification pipeline so the followee can see they’ve gotten a new follower; and so on.

Each of these downstream services has one or several defined input message types, and a defined output message type or types.

And the key is that we can keep on plugging more consumers. We could add monitoring, so we can see if the rate of follows goes way down (and business metrics might be affected by a change in our app design or something like that) — or if it goes way up, in case something weird is going on.

We could add an abuse prevention mechanism, to see if something weird is going on; we could add analytics in case we need to make a report to Wall Street; and so on.

All these systems can be developed independently, and connected and composed in a way that is and feels robust.

And just as functional languages let us divide a large problem into subproblems and subsubproblems then combine these problem solutions together to solve the original problem and other ones, using event-driven services lets us glue granular functions together in flexible ways.

We can localize knowledge about the details of each computation, in analogy to higher-order functions. And we can perform a computation exactly and only when needed, because an event is created only when something changes, which is analogous to lazy evaluation.

(In contrast, in a request/response system, we might need to poll a service dependency (and have it perform a query) a thousand times to receive a single change.)

A side benefit of this is that we get asynchronous flows for free.

Often, there are computations that, even in the request/response system, we would want to happen outside of the user path — or even more infrequently, like in an hourly or daily batch job.

It used to be that we’d need to do a bunch of orchestration to set up batch jobs with say Airflow or something, but if we handle the real-time case with events, then our offline jobs are just like any other consumer, and work by default.

And this functional composition makes things easier to debug and trace too.

We have, by definition, a central, immutable, saved journal of every interaction and its inputs and outputs, so we can go back and see exactly where something failed.

To get this with request/response, we’d have to invest lots of time and money in a full distributed tracing system — and even then we might not see where latencies are hidden.

Another reason event-driven systems are helpful is that they can provide scale by default, and make some of the patterns we use for resilience in request/response service architectures unnecessary.

For example, the distributed log acts as a buffer if the recipient is unavailable or overloaded, and automatically redelivers messages to a process that’s crashed, and prevents messages from being lost; so this makes retries with backpressure and healthchecks less necessary.

The log prevents the sender from needing to know the IP address and port number of the recipient, which is particularly useful in a cloud environment where servers go up and down all the time; so we don’t need complex service discovery and load balancing as critically.

Fifth, if we’re writing an application today, we’re likely to want to use several different data storage systems, to serve different use cases.

There’s no single “one-size-fits-all” database that’s able to fit all the different access patterns efficiently.

This is sometimes called “polyglot persistence”.

For example, we might need users to perform a keyword search on a dataset, and so we need a full-text search index service like Elasticsearch.

We might need a specialized database like a scientific or graph database, to do specialized data structure traversals or filters.

We might have a data warehouse for business analytics, where we need a really different storage layout, like a column-oriented storage engine.

When new data gets written, how do we make sure it ends up in all the right places?

We could write to all the databases from synchronous microservices — dual writes, or triple or quadruple writes — but we’ve seen how that’s fragile. And it’s impossible to make dual writes consistent without distributed transactions, which you can read about in detail in this paper from Martin Kleppmann and collaborators.

A more provably correct way of doing this is with event-driven systems. There are two main patterns we can choose from to do this in an event-driven way.

First, we could do this with database change capture (a database-first approach), or we could do this with event sourcing (an event-first approach).

The database change capture pattern is the one I’ve used more often in practice, and it’s an old concept.

The idea is that we let applications subscribe to a stream of everything that’s written to a database — inserts, updates, and deletes.

Then, we can use that feed to do the things listed earlier: update search indexes, invalidate caches, create snapshots, and so on.

There are open-source libraries now to help get the feed of every database change into a broker like Kafka where consumers can access it — for example, there’s Debezium, which uses Kafka Connect. And databases are increasingly starting to support change streams as a first-class interface.

At VSCO, because we started doing this back in 2016 before there were common tools for this, we wrote our own Golang service called Autobahn to get the database change feed into Kafka, by tailing the MySQL binary log and MongoDB oplog directly. And we use something similar for our relational databases at eggy.

Going back to the follows use case for a social network, the way that the event gets into the distributed log in the first place is that the follows service writes to an online transactional database, say something like MySQL or MongoDB.

Then, our database change capture service writes every change in that database to the distributed log, where any number of consumers can do whatever they want with it.

Sidebar: this ends up being a nice way to migrate off a monolith — we can let the monolith keep doing the initial database write, suck those writes into our event store with database change capture — then pull out all the reads, and all the extra stuff the monolith does for every write, into composable workers reading off the event store.

This helps us fix one of the most common pitfalls in microservice architecture, where we have a bunch of independently deployable services, but they all share a single database.

This kind of broad, shared contract makes it hard to figure out what effect any changes might have.

We would rather have our data not be so coupled, but it’s hard to unbundle it.

Database change capture can make it much easier.

We also can save even richer data with event-driven systems, if we want to.

Writing data directly to the log can produce better-quality data by default than if we just update a database.

For example, if someone adds an item to their shopping cart and then removes it again (we’re buying furniture), those add and delete actions have information value.

If we delete that information from the database when a customer removes an item from the cart, we’ve just thrown away information that might have been valuable for analytics and recommendation systems.

This approach, where we write events first and directly to the log, is usually given the name event sourcing, which comes originally from the domain-driven design community.

Like database change capture, event sourcing involves saving all changes to the application state as a log of change events.

But instead of letting one part of the application use the database in a mutable way and update and delete records as it likes, in event sourcing, our application logic is explicitly built on the basis of immutable events.

And the event store is append-only, and usually updates or deletes are discouraged or prohibited; and then to get a copy of our current working tree we’d aggregate over all the diffs over time.

In a fully event-sourced system the current state is derived.

Both this and the mutable version are valid approaches, but they come with slightly different tradeoffs in practice. Regardless of which one we use, the important thing is to save facts, as they are observed, in an event log.

In practice, I haven’t personally used this pattern as much as the database-first change capture approach, because the combination of fast transaction-protected state updates and totally ordered downstream effects has been hard to beat — but the event sourcing approach is just as valid if we have different design constraints, and possibly more theoretically beautiful.

And one more thing falls out of all the other reasons, which is that this makes integrating machine-learned logic very straightforward.

Nowadays, we often want to plug machine-learnable programs into different parts of our software, where we can get better results than by hand-coding a program — and this will be increasingly true in the future.

Say that we want to decide on the fly whether our request is from a legitimate actor, or send a user through a different onboarding flow depending on how they behave, or tune our database index as new data of different shapes comes in.

Inserting ML like this will really just not be tractable if we do this in a point-to-point, request/response way.

(Many of us have probably been asked to insert a machine learning inference step into an existing system — and had to build a ton of hacky infra around it to get it to work. Which is frustrating, and not scalable — our software just becomes totally incomprehensible.)

But if we’ve already switched to using an event-driven system, we have pluggable data integration is already there for us, and adding ML is just adding another consumer or producer.

Lastly, and this is more speculative, but where I hope we can take this one day, is — can we thread this all the way through to the clients?

Right now, clients have to poll for changes to find out when something they’ve loaded has changed. So the fastest we can possibly get an update is our polling interval, which is probably too long to work for any low-latency use cases.

But if we can thread event-driven architecture end-to-end, which we almost have the right tooling for, we can push event-driven notifications all the way through the stack to the browser or the mobile app or the forklift or Saildrone or whatever.

If we do that, we won’t have stale data any more, and we won’t need to poll every second.

All that said, there are some things to watch out for when writing your event-driven systems.

A key decision is how we want to structure our event data, since the events are now our input and output interface, the way that our request and response types were our interface in a request/response system.

Our systems will change over time, as features are added or changed, and we’ll need to handle backward and forward compatibility in order to make sure we can update to newer versions without breaking our applications.

Just like with request/response, we’ll most likely want to use a schema language with evolvability rules and compiler checking of constraints, like protocol buffers, Avro, or Thrift.

At eggy, we use protocol buffers plus a custom codegen plugin, that lets us define input and/or output types for each of our event-driven services. These types then generate bindings that handle the standard worker functionality, check constraints, and let us write tests — kind of like gRPC, but for asynchronous services instead of synchronous ones. (Which if there’s any interest in we can put on Github.)

I would really just recommend using some typed data serialization system here, where we can ensure safe schema evolution in a developer-friendly way, and not trying to manage changes on top of JSON or XML or something.

Another piece is concurrency management: if we have distributed data management, how do we make sure we read all the implications of our writes?

The event-driven service approach guarantees that every event will eventually be processed by every consumer, even if there are crashes, but there’s no guarantee about the maximum time until an event is processed.

This means our data stores will eventually converge on a consistent state, but they might be inconsistent when they’re red at one point in time.

Request/response doesn’t exactly solve this problem either, but it’s more obvious when we’re using an event-based approach, and we have to design explicitly for eventual consistency.

In our event-based systems, we depend heavily on the distributed log to provide strong semantics around total ordering.

Distributed logs do this really well, but we have to be careful about how we structure and partition our logs, in order to scale without losing ordering.

In particular, there’s a tradeoff between using a single partition for a topic, which limits the total scalability of the system but guarantees total ordering within the topic; and breaking the topic into independent partitions by entity, which guarantees ordering within an entity’s events but not ordering between entities.

If we know that our data model is more complex than can be handled with a single partitioning key, we can handle interactions between partitions by breaking them down into multiple event-processing stages.

Relatedly, we have to consider the notion of the identity of our entities very carefully. All of our messages need to be uniquely identified in order to detect duplicates, implement optimistic concurrency control, manage deletes, and so on.

In general, the event-driven approach does require more up-front design, with a payoff of lower cognitive load and simpler changes in the future. (The way of the future.)

Just like in request/response systems, privacy and data deletability are critical, given GDPR, CCPA, and other privacy regulations, and also just doing right by our users.

When we delete a single piece of data, we want to make sure to delete all pieces of data derived from it — which might now be in many datastores.

Event-driven processing can make it easier to perform this deletion, but might also make it more likely that we have more data to delete.

And with all of these decisions, we need to make sure we don’t conflate the underlying fundamentals with the tools we choose.

It’s not about Kubernetes or Kafka. Tools and frameworks change quickly, but the fundamentals stay mostly the same.

The biggest gotcha though — though it’s becoming less and less true — is that event-driven interactions are not the norm yet.

And so there are fewer libraries and tools and blog posts, and example repos about this, than about request/response architectures.

But many things we use and appreciate today weren’t the default a few years ago either, and we’ve seen now how they’ve made our lives easier.

I didn’t have time to show a demo of this here, but I put one up on Github. There’s code and a short video (almost lightning-length). It’s a full-stack (web and backend) breakfast delivery service with two different architectures: first, request-response microservices, then event-driven microservices.

Check it out if you don’t already have an event-driven system and want to try to run one, or just see one running; or if you’re interested in breakfast.

And if you’re interested in lunch, I’ll let you go now. Thanks so much for your time, and have a great Strange Loop!