Published on March 7, 2024
This is the second post out of the two-part story about environment cloning improvements at Contentful. In the previous post, we covered the challenges of the existing environment cloning process and introduced the idea of applying the copy-on-write approach to improve it. We highly recommend reading the first part to gain more context around the problem and the solution.
The first post ends on a dramatic note — our initial vision of copy-on-write was challenged by a significant performance issue. Let’s start the second part with our journey to solving that problem.
Initially, we envisioned mapping tables purely as a lightweight mechanism to share content records across environments. They only had two columns — one to identify the environment and another one to identify the content record belonging to that environment.
The idea of such a structure comes from general database structure normalization ideas when implementing a many-to-many relationship. But we proved again that normalization leads to improvements in data redundancy and integrity, but not necessarily performance.
It’s worth revealing more details about how we decompose API entities (i.e., the entries, assets, or content types accessible via the APIs) into relational tables:
The entity table holds the metadata of an entity, such as timestamps, versions, etc. These are exposed in the API payloads in the sys
object properties.
The entity fields table holds the data customers edit via CMA or the web app.
These are exposed in the API payloads in the fields
object.
So, the combined structure of entity tables with copy-on-write would look like this:
Initial copy-on-write
table structure.
Reducing the size of tables and indexes could have helped performance by reducing the amount of data each query had to access. If you remember from our previous blog post — about 80% of the content in our databases was not unique and we aimed to remove it after the copy-on-write implementation.
Removing duplicates would’ve allowed us to get the necessary performance back. But to remove duplicates, we need to read data using the mapping tables, which doesn’t work well with duplicate data. It’s a perfect chicken-and-egg problem.
We also kept in mind that solving the performance problem by removing duplicates was not scalable. Even if we somehow remove duplicates and start using the current version of mapping tables in production, we would just hand over the performance issue to future us — tables and indexes would still grow over time to their current size.
Since an additional join introduced the performance problem, we started thinking about how we could reduce the number of joins back to one.
We could simply join the mapping table with the entry fields table to fetch the content stored in the entry fields table, but we still needed sys properties from the entry table to have a complete set of data to generate a proper API response.
So, after experimenting, we realized that denormalization was the key — let’s move sys properties to the mapping table!
Denormalized copy-on-write
table structure.
The entity table was still there, but now it served merely as a connection point supporting a many-to-many relationship between the environment mapping table and entity fields table. This allowed us to skip joining the entity table when reading data and join the mapping table with the fields table directly:
SELECT * FROM environments_entries em JOIN entry_fields ef ON em.entity_id = ef.entity_id WHERE em.environment_id = “abc”;
And, most importantly, the query performance was back to normal.
We were happy to move forward with the solution that allowed us to roll out the feature without affecting the API performance. Contentful has hundreds of databases hosting terabytes of content. Restructuring core tables at this scale isn't something that can be done in several days, especially when we don’t want to impact the system’s availability. We still needed to come up with a detailed plan to deliver such a massive change to production.
The most straightforward solution was to deliver it silently, step by step. But it involved a lot of data migration steps:
Add sys columns to the mapping tables.
Start writing sys properties to the mapping tables, as well as keep writing them to the old entity tables.
Backfill sys properties in the mapping tables for existing entities.
Change the read queries to start using mapping tables instead of entity tables.
Stop writing sys properties to the entity tables.
Drop sys columns from the entity tables.
Implement and launch a complex procedure to remove duplicates.
This all seemed doable until we dove deeper into the last step — removing duplicates.
In the previous post, we covered the approach of identifying duplicates by diffing entities with SQL EXCEPT
clause. We were happy with its performance, but we realized that there was something else we needed to take into consideration — concurrency.
Let’s say we want to identify and remove duplicates in one database transaction. So, we run our diff query and we know that entry-1
is the same in the two environments. Now we need to keep that information in memory and run another query to decide which entry stays and which one gets deleted. We run another query to update the mapping row of a duplicate, and voila, the duplicate has no mapping references:
But what if someone changes any of those entries in between the two queries? If someone changes entryThatStays
in its original environment — the changes from that environment would leak to other environments. And if someone changes any entry from the duplicates array — that change would be lost.
We also cannot lock individual rows on the mapping table because we only know which rows we are going to update after we fetch the duplicate IDs.
It was at this point we realized there was no way to safely remove duplicates without enforcing some sort of read-only time for our customers.
Since we realized there was no way to achieve our goal without introducing a read-only time, we started thinking about how to make more use of it.
The deduplication procedure had to identify and remove ~80% of entities from the database, and while identification was fast, it would still take quite some time to simply run the DELETE
statements. Alternatively, instead of removing duplicates, we could copy unique entities into new tables and swap old and new tables on the fly. We also kept in mind that only 10% of our content is unique, so we concluded that it would be easier and faster to just copy unique content instead of deleting 90% of duplicates.
And if we need to copy that somewhere — why not create a completely new schema from scratch?
Creating a new schema had several advantages:
Brand-new non-fragmented indexes and tables which should speed things up.
Once we have moved all unique content to the new schema, we can simply run TRUNCATE
on the old one, which is much faster than running DELETE
on individual rows.
No need to swap tables. We can simply redirect reads to the new schema once the migration is complete.
All writes go to both schemas, so we keep them in sync.
We could keep writing to both schemas, and if something goes wrong, we could always roll back reads by routing them to the old schema. This would also give us time to safely fine-tune the performance of the new schema.
Isolation of the new schema reduces the blast radius if things go wrong.
And, of course, disadvantages:
Keeping data consistent in both schemas would increase the latency of writes.
We would have to write data twice for every table, and that must happen within a transaction to keep data consistent. Operations within a transaction are serial; it’s not possible to run them in parallel. So, there’s an inevitable increase in the time it takes to make a consistent change in both schemas compared to a single schema.
We would need to implement sophisticated query routing between schemas.
If it sounds complex and looks like a lot of work, that’s because it is. But this dual-schema approach was the safest way forward, giving us peace of mind that customer data wouldn’t be lost along the way. So, we went for it!
At this point, we had a clear vision of the migration process. Let’s walk through the high-level overview:
First, we introduce the new schema and migrate all spaces there. While doing so, we keep writing to the tables in both old and new schemas so that we can roll back in case of problems:
To gain confidence about the new setup, identify problems more easily, and limit the impact of any problem, we gradually switch reads to the new schema:
And finally, once everything is migrated, stop writing to the old schema and clean it up:
Simple, right? So we rolled up our sleeves and brought the plan to life.
Before we dive into the implementation details, it is important to explain the motivation behind our decisions.
First, we had to migrate terabytes of data. Many things could go wrong along the way, so we wanted to reduce the risks as much as possible. We optimized for safety and reliability instead of migration speed and resource utilization.
Second, we wanted to design a process that allowed us to easily abort the migration at any given point in time. It had to allow us granular control over query routing so that we could quickly revert everything back to the old schema at any point. And all of that had to happen without customers noticing.
That is why we decided to keep the dual schema setup for as long as possible despite the cost overhead. This also required us to spend a significant amount of time preparing the application side and testing the new setup.
We were going to have two different schemas at the same time, so the application layer must be aware of it and execute correct queries on the correct schema.
The most important part was decision-making — how do we know which schema(s) to use for a given environment? We needed to control reads and writes individually on an environment level. Since we also needed transaction-level consistency to avoid holding the state in memory, we had to think about various race conditions. We decided to add columns to the environments and spaces tables to track their individual states.
We also kept in mind that with every additional flag or column, it gets harder to reason about the system’s state. As such, we kept the number of new columns to a minimum.
We added four new columns overall:
Environments.migration_finished_at (timestamp)
Spaces.migration_started_at (timestamp)
Spaces.migration_finished_at (timestamp)
Spaces.readonly (boolean)
These columns supported migration read-only functionality, as well as routing writes to the necessary schema for migrated environments and spaces. To control reads, we introduced a feature flag that controlled which shard reads from which schema for migrated environments. The application code handling API requests would read the flag value to select the right schema.
Now that we had all the necessary flags, it was only a matter of adapting the application side to use them. Since the decision-making happens at runtime, the application had to find out which schema to use for every individual request by reading the feature flag and running SQL queries to read column values. This created performance overhead, and we planned to compensate it simply by upscaling the database instances.
Changing the rest of our data access layer was a tedious task, but the amount of time we put into planning gave us a clear understanding of every little detail, so we just had to follow the plan. You can have a glimpse of the final decision-making chart for query routing:
Redacted flow chart of dual-schema query routing.
As demonstrated in the previous blog post, the old copy process had to copy source environment rows into temporary tables, update values there, and write them back into the original tables.
On the other hand, the new process was pleasantly streamlined and consisted of simple insert statements:
New environment copy only copies environment mapping rows.
For 95% of environments, the new process took less than 30 seconds to copy it. Needless to say, the amount of database resources it consumed was an order of magnitude lower than the old approach.
On top of the brand new procedure for the new schema, we also introduced a wrapper on top of the old one that takes into account all intricacies of a space and environment state to figure out which schema(s) to copy the environment on.
The pseudocode below demonstrates the new decision-making process:
As you can see, even when the source environment was migrated, we had to copy it on both schemas to keep them in sync. And yes, this made the copy process even slower. However, the new process took just several seconds, so the overhead was barely noticeable to the end user. In addition, the amount of “just in case” safety it gave us was worth it.
Once we ensured that the new schema performed well for read queries and the data was consistent, we got rid of dual copy and fully switched to the new copy process for migrated environments. We also prepared the procedure for spaces and environments that exist only in the new schema:
The biggest fear of an engineer during such migrations is losing data. With the new schema, we had to change the way we query content. It was a major change in a very sensitive area, hence, we covered everything with various tests. But tests operate on synthetic data, while our customers can build data structures of different layouts, volumes, and complexity levels. There was no way we could replicate this in tests to say, “Yes, it definitely works in all possible cases.”
So, how about we run the tests using the customers’ data? For a small fraction of production read requests, we ran the queries on both schemas and compared results on the fly. If there were differences between the two datasets — we triggered an alert and manually reviewed every such case. We also gave the process a creative name — consistency check.
Did we get any alerts? We got plenty! Were there actual differences in the datasets? Well, it depends.
Once we got our first alerts, we rushed to check the actual data in the database, and … it was identical in both schemas. But we were sure it wasn’t the case during the consistency check. Which could only mean one thing — a race condition.
The consistency check performed queries on two schemas within one transaction, so where’s the difference coming from? Well, from a parallel transaction that changes data. You see, we always used the default Postgres transaction isolation level READ COMMITTED
, which allows queries within one transaction to see changes made by other committed transactions. To achieve the desired behavior, we needed the REPEATABLE READ
isolation level, which only sees data available at the transition start and ignores changes made outside of it.
Once we changed the isolation level for the consistency checks, we stopped receiving false-positive alerts. We started receiving real inconsistencies instead 😱.
However, those were not about wrong data per se. The difference was in the order of items returned from two schemas. We always apply a specific ORDER BY
to all queries to ensure a deterministic ordering, but what happens when the values we order by are the same in several rows? Well, then the order of such rows will be defined by the way Postgres reads data, which could, for example, be defined by the physical table or index layout. This layout will be different for every table and schema, leading to a different order for each.
Is having a different order between schemas the problem? After looking at it from different angles, we realized it could act as a real inconsistency from a customer perspective. Imagine having 55 entries and requesting them from CMA with a page size of 50. There will be two pages, and let's say items #50 and #51 have the same values in the fields you sort on. You retrieve the first page, and we read it from the old schema. You get item #50 as the last item on the first page. Now we switch reads to the new schema, and you’re fetching the second page. The query lands on the new schema, and we can’t guarantee that you won’t get item #50 again as the first one on the second page.
We found a simple solution, though: let’s always sort entities by public-facing ID, which was consistent across schemas at the very end of the ORDER BY
clause. Luckily, the row ordering problem exhausts our inconsistencies list.
Since the complexity of the change spanned far beyond a regular DML migration, we decided to implement a separate migration tool.
Before starting the migration, it made the space read-only and … waited. Waited for the in-flight CMA requests to be completed so that outside changes to the space wouldn’t be lost. We had to wait for up to 30 seconds before starting the actual migration. This doesn’t seem like a big deal, but if a database contains thousands of spaces, these 30 seconds easily pile up into days.
After realizing this, we decided to optimize the waiting time and made spaces read-only in batches, so that we make more use of that waiting time (wait_time_per_space_seconds = 30 / spaces_batch_size
).
The tool fetched the environment tree for a given space. The tree contained all environments in a space, organized by a parent-child relationship, so that we knew which environment was the source and which was the target during environment copy. Then it copied the root environment over to the new schema as is. For the rest of the elements in a tree, it used the on-the-fly diffing between parent and child environments to copy only the unique content to the new schema and skip duplicates.
Every environment would be migrated within one transaction, so when anything went wrong, the state was reverted by the database, and we didn’t need to take care of it. Before committing the transaction, it did lightweight checks like counting the number of records migrated to ensure that we hadn't missed anything. It also took care of setting flags for a space or environment after completing the migration so that our query routing logic could pick those up.
You might ask — what happens when a database is overloaded during the migration? Glad that you asked! We reused the auto-pausing migration technique that was previously showcased in another blog post. So whenever the CPU, RAM, or IO load of a database crossed certain thresholds, the migration automatically paused and waited for that metric to return back to normal.
Contentful hosts the content of various scales for thousands of enterprise customers. We planned a gradual rollout so that we could assess the results as we go, so migrating all of them at once wasn’t an option. The simplest way to roll it out was to go shard by shard until all of them were migrated.
But remember, to migrate a space, we had to make it read-only, which means customers couldn’t change anything for the duration of the migration. Obviously, we couldn’t start the migration without letting the customers know. For smaller customers, we scheduled the maintenance window outside their working hours. For bigger ones, the migration duration was longer than our SLA allowed, so we had to wait for their go-ahead to run the migrations for individual spaces at a specific time. This was something we expected to happen, but we underestimated how much time that manual coordination can take.
Technically speaking, the migration was a great success. Our cautious approach paid out and we didn’t lose any data and had no incidents related to this gargantuan project. And, despite the expectations, we didn’t need to upscale most of our databases — they had enough resources to support the migration and the following period of dual writes.
Finally, let’s talk numbers.
Our predictions came true — after migrating all data, the new schema was on average 80% smaller than the previous one. Less data allows us to spend less on data storage. But it’s not only that — data size reduction has a great impact on the efficiency of many database operations — the indexes and tables are smaller, so index and table scans have to read less data, consume less RAM, less disk IO, and run faster overall. Which translates into lower latency of API requests.
Here are some specific numbers for some of our databases:
Database | Before | After | Reduction |
---|---|---|---|
#1 | 268.51 GB | 8.51 GB | 96.83 % |
#2 | 995.33 GB | 14.31 GB | 98.56 % |
#3 | 390.83 GB | 16.59 GB | 95.75 % |
The average environment copy is now 50 times faster! It might sound like we’re picking a metric that shows the best result and avoiding mentioning the rest, so we’re ready to reveal more.
Ninety-nine percent of environment copies (P99) had the following max copy durations:
Infrastructure | Before | After |
---|---|---|
Single-tenant | 3000 seconds | 130 seconds |
Multi-tenant | 5100 seconds | 55 seconds |
The raw numbers might be hard to comprehend, but let’s try to think about it from the product perspective as something that brings benefits to our customers:
Predictability: Teams now get consistent environment copy time and can plan their operations accordingly.
Increase in efficiency: A 50x increase in speed represents a monumental improvement in the copying process. This allows for near-instantaneous creation of environments, transforming the workflows for content management.
Productivity gains: Such a dramatic speed increase enables teams to work at an unprecedented pace. This can lead to a much quicker turnaround on projects, freeing up resources and time for additional tasks or enhancements.
Revolutionized development and testing cycles: With copying being that fast, development and testing cycles could be drastically shortened. This allows for more iterative and agile development practices, enabling rapid prototyping and testing.
Enhanced scalability and flexibility: The ability to copy environments in less than five minutes provides immense scalability and flexibility. Companies can scale their operations more effectively, adapting quickly to changing requirements without being bottlenecked by environment copy times.
We received plenty of feedback for the new copy process, and we’d like to share some quotes here:
I just tested the ability to create new environments based on the master in a matter of minutes (1-2), rather than the 15-20 minutes we’re used to seeing. The new environment I created this morning was ready in around 30 seconds, this is a game changer for our development workflows and internal tooling across the board.
– Company from the payments industry
Fifteen seconds to create a new Contentful environment following the infra upgrade. Absolutely stellar stuff.
– Company from the business analytics industry
The space looks like it's working properly, and I can confirm the improvement: environments are lightning-fast to create! Thanks a lot to you and your teams for this evolution, as well as the smooth migration! That’ll come in very handy for our evolutions and ability to leverage environments for feature branches, as well as overall stability.
– Company from the gaming industry
Intrigued to learn more? Eager to hear the full story? Don't forget to circle back and read part one of this two part series.
Subscribe for updates
Build better digital experiences with Contentful updates direct to your inbox.