Published on July 24, 2024
Effective distributed system design improves the operational efficiency and cost advantage of your infrastructure while increasing the performance of individual workloads.
In particular, architectures that rely on microservices communication benefit from top-down planning and implementing the most appropriate design patterns for a given use case.
This post summarizes the critical elements and design patterns used in distributed systems, and explains the benefits (and challenges) each introduces into your infrastructure.
We'll then discuss workflows, the design strategy that we implement in the systems that host our Contentful platform, and how it solves many of the common microservices problems we encounter (especially around data consistency).
Distributed system design is the process of researching and designing the best possible implementation of a distributed system for a given project.
Distributed systems spread their processes and workloads across multiple physical nodes. They achieve this by splitting an application into smaller services (or single-purpose microservices), each with a focused business function or task that can be reused across different applications.
The primary advantages of a distributed system that implements microservices over a monolithic architecture include increased scalability and flexibility by letting you scale components individually and isolate heavy workloads so that they don't affect the performance of other services.
Additionally, microservices-based systems enable continuous availability and better operational efficiency: If a node fails, the system can route traffic to another that is running the same service so that the system as a whole can keep running. Distributed system designs can also eliminate bottlenecks by increasing the number of nodes for a specific service.
These benefits help ensure that your application is efficient, always available, cost-effective relative to the current user demand, and performant and responsive to your end users.
Before we get into the major challenge with microservices-based distributed systems, here’s a quick refresher on the main distributed system architectures and patterns, and some of the obstacles we've encountered with them:
Microservices architecture splits applications into small services that operate independently, each performing a specific task and communicating via APIs. This allows for scalability and flexibility, at the potential cost of complexity and microservices communication issues affecting data consistency.
Service-oriented architecture, a predecessor to microservices architecture, splits applications into services, each performing a business function (less granular than the specific tasks performed by microservices). This allows for reusability of services across applications, but lacks the fine-grained scalability that microservices offers (while being less complex to manage).
Event-driven architecture includes the pub-sub pattern and event sourcing, and structures an application so that rather than services and microservices communicating directly, events and state changes trigger the next task in a chain of events that lead to the intended outcome. This is highly responsive and keeps components decoupled, but introduces further complexity in maintaining consistency of data between nodes.
The Saga pattern is designed to maintain data consistency for complex operations that span multiple nodes. It focuses on transactionality: Each step has a compensation step, which will run if a step fails, to roll it back. Saga can be coordinated by either orchestration or choreography, depending on whether you want a centralized or decentralized approach.
In addition to these architectures, there is the workflows pattern (which is based on Saga) that we have successfully implemented at Contentful, which solves many of the above problems. But first, let’s look at the issue that led us to implementing workflows.
You'll crash and burn if you try to build the perfect distributed system from scratch (I know from experience — Gall’s Law holds true in practice!), but this doesn’t mean you should be wary of microservices-based approaches. Instead, you should recognize the challenges microservices can pose if not implemented properly, and the potential cascading effects on your application. Most commonly, issues around microservices communication, and maintaining data consistency, can quickly upset your plans and require a trip back to the drawing board.
I won't rehash what my colleagues have already stated: Implementing microservices from scratch is a heavy task for teams of any size, because of the complexity, partial/cascading failures, issues with consistency and concurrency, and the need to properly handle timeouts and retries — but the payoff is well worth it as demand scales with your business.
One proven way to avoid early-stage microservices headaches is to simply start with a monolith, prove your concept, write your code, and then later, only when demand necessitates, break it down into microservices gradually. This makes it possible to focus on isolating a specific part of the application, thoroughly test it, and only then move on to the next, rather than trying to spin multiple plates.
This was the scenario at Contentful. Our monolith was driving our functionality, but we needed to scale and make our codebase more extendable, so we began our shift to a distributed system design. As an example of what kind of tasks we needed our system to perform, let's look at what the system needs to do when creating a Contentful space (a workspace that contains all content and media for a Contentful project):
First, create the record for keeping track of the new space's unique key and name.
Then, find and assign a location in our set of databases to store the content that is added to the space.
Next, create records that define permissions, your content allowance, etc.
Throughout all of this, route the space's unique key correctly, so that when you want to find the space and retrieve permissions, metadata, contents, etc. the system can find all of these.
All of this has to happen after you've pressed the button to create a space, and before we tell you to go ahead and use it (and we can't keep users waiting too long!).
Our first foray was an event-driven architecture with a message broker, but we soon ran into problems: After moving from our synchronous monolith, our service-based approach ran into the inescapable fact that no component has 100% guaranteed uptime (and components don't all fail at the same time, either!). Even with the briefest of single-component outages, this is a problem at scale with millions of users — huge amounts of data would enter an inconsistent state and need to be reconciled, in our case with a cron job that was rapidly reaching the point where it would not complete before it was scheduled to run again.
Event-driven seemed like the right direction as we wanted low latency, but it doesn't consider the fact that different processes need to give different guarantees: Some processes (like assigning a location to store data) must complete before a space can be used, while others (like an email notification) can be delayed without negatively impacting the user. If a process fails, subsequent processes have no way of knowing about the failure, leading to disruption. We did investigate using event-sourcing with Kafka, as well as AWS EventBridge, to get around this, but we had concerns about implementation costs, ownership, and the noisy neighbor problem.
There's one key thing I keep in mind when designing a system: What does the user expect? This is what determines what I will prioritize and optimize for performance. Tasks that must guarantee completion before the user flow can progress must come first, followed by those that should happen in a time frame the user finds reasonable.
As an analogy, you'd find it quite reasonable to have to wait while your electric car charges, but if, once your battery is full, you were forced to queue for even a few more minutes to pay an attendant, you'd become frustrated (why doesn't this place just take automatic payments like everywhere else?). You'd be doubly frustrated if the queue wasn't moving because the attendant forgot to call the next person up to pay, and everyone had just been standing around for no reason.
Similarly, users are happy to wait a couple of seconds while we spin up their Contentful space, but if they opened up their new space and couldn’t create anything in it until other background tasks had completed, they'd be (justifiably) annoyed. Similarly, if the process had fully completed and the space was ready to use, but the loading screen never progressed, they might think something was broken and head to our support channel, wasting their time.
Workflows is a distributed system design pattern that focuses on the user journey. The system runs tasks serially as steps in the workflow, either individually or in groups that execute their members in parallel. This means that tasks that rely on low latency between them, or that can be run concurrently to increase speed, can work in a group, but you can maintain reliability and consistency by ensuring that things occur in the correct sequence. This approach also maintains the ability to scale services and groups individually.
Combining this with transactionality and rollbacks, we solve the data consistency issue. This works similarly to how a database runs multiple queries in a transaction: If one fails, they all roll back. When a workflow step fails, it causes the orchestrator to send out another call telling all components to roll back in reverse order.
This design pattern keeps the user in focus, optimizes for reliability, and also gives engineers an improved debugging experience compared to other event-driven architectures: as the steps are executed in a clear order, there are distinct failure points. These benefits do come at the cost of latency, however.
Probably the best way to demonstrate workflows is to outline how we do it at Contentful. To replace our event-driven systems with workflows, we use state machines based on AWS Step Functions. Following the Saga pattern, whenever we migrate a component to our workflows architecture, we break it into steps and make them transactional (by adding compensating code), and run some of those steps serially but group others together and run them in parallel within the group, depending on what guarantees are required.
Once a user makes a request for a new Contentful space through our web app, our system executes the following workflow:
The orchestrator kicks off the workflow.
Service A creates space records (and a unique key to link everything together). This is run on a service on our clusters that maintains the central records.
Service B gives the space a shard placement to store data. This is handled behind the scenes by emitting messages to our RabbitMQ broker, with the publisher currently being transitioned to AWS SNS.
Service C checks the user entitlements and assigns a license. This happens partially on the cluster and partially serverless via its own custom AWS Step Function-based workflow.
Service D then routes traffic so that the space is accessible to the user, which again happens at cluster level.
If any transaction in this workflow fails, the system rolls back every preceding step using its compensating transaction. The user then receives an HTTP 4XX error, and they can try again. If everything succeeds, the workflow returns an HTTP 200 and the app can then safely deliver the user to the newly created space.
The orchestrator in our Saga workflow pattern defines and manages the order of steps. When the steps in service A have completed, the orchestrator calls service B, then C, and so on. This guaranteed order ensures that whenever the system executes an action, it can compensate (or roll back) that action if there is a failure. Note that some of the services in the above diagram perform a single task, while others consist of multiple services operating asynchronously, depending on whether they need to guarantee completion, so figuring out how to fully compensate for a step is part of deciding how you should group things. A common requirement here is that if a workflow emits a message, it should be at the end of the chain, as rolling back messages is harder than rolling back local state. Steps that cannot be compensated for should always be positioned at the end of the workflow.
Of course, I can't share too much technical detail about our specific implementation (trade secrets!), but you can find a practical example of implementing a state machine workflow using AWS Step Functions here.
We do still use event-driven designs, especially where latency matters. You must pick the best tool for the job (there's no silver bullet that works for every use case), and not restrict yourself to using one system design across a whole application. Workflows and event-driven designs are not exclusive to each other, and can co-exist with different steps and groupings that best fit your technical requirements and user expectations.
(As a side note, I also experimented with Elixir for building transactional workflows, which I can also recommend as an alternative technology for these scenarios, as many built-in primitives allow for quick iterations if you haven't reached planet-scale yet).
Contentful itself forms part of many businesses' distributed system design: It allows you to focus on building your unique functionality, and leave the task of hosting and managing any kind of content to our composable content platform.
Scaling at Contentful is a bit different to other platforms, as whenever we sign up a new user, we're really delivering content to all of their users. Our infrastructure needs to be able to handle increased demand at the drop of a hat.
A workflow approach to our distributed system designs has gone a long way in making sure that we can provide a seamless user experience, while ensuring our content is always available, ready to deliver to their apps, websites, and other channels.
Subscribe for updates
Build better digital experiences with Contentful updates direct to your inbox.