S3 Express One, Value-Less LSM Trees, ShardStore

It’s early 2024 and S3 continues its ubiquity in computing. The often touted adjectives “serverless” or “cloud-native” usually mean—”we’re using S3 for something.” Amazon continues to push the boundaries of scale and further solidifies itself as the foundation for countless applications, platforms, and companies.

Recently, Amazon released their S3 Express One Zone Storage Class, advertised as the “fastest cloud object storage for performance-critical application.” In this post, I’ll highlight three key features of Express One and share some guesses regarding their implications on S3’s internal architecture.

S3 Express One Zone Highlights

Single Availability Zone

Up until now, S3 has always been scoped to an AWS region. With S3 Express One, your data is now scoped to an Availability Zone (AZ). By enabling this, you sacrifice a bit of durability (no longer tolerant to AZ failures), but gain the performance benefits of co-located storage and compute.

Crazy Default Throughput

Classically, S3 enables up to 5.5K Requests-Per-Second (RPS) GETs (3.5K RPS for updates) for a given prefix (path leading up to an object’s key). Customers can easily bump up their throughput by manually adjusting their prefixes. However, if you’re on S3 Express One, you’ll get hundreds of thousands of RPS out of the box.

Support For New Medias (SSDs?)

Single-digit millisecond access times for persistent storage sounds excellent. The S3 team has never explicitly mentioned SSDs, but I believe it’s highly likely that this kind of performance is backed by new storage medias. Behind Amazon’s “Storage Classes”, there are strong signs that they are investing in new hardware types alongside new and novel software techniques.

The Vision

S3 is becoming the infinitely durable, infinitely scalable hard drive for the Internet. Let’s dive into each of these areas in a little more detail.

Regional Scope to AZ Scope

S3 is a regional service where the availability of one region (us-east-1) can never affect the availability of another region (eu-west-2).

When S3 first launched, the AWS team intended for it to be a global service where objects could be uploaded to a specific region and be eventually available from any region. The object’s data bits were PUT in one region and its indexing data would be replicated everywhere.

This proved challenging for a few reasons:

Best effort replication of indexing data created unpredictable amounts of eventual consistency, e.g. upload an object in us-east-1, but wait hours/days before it’s visible in us-west-1
Availability of all regions were coupled together in the general availability of S3
Unpredictable latency moving data bits along a big network

Since ~2010, S3 ceased their intentions of being a global service and has been regional since.

AWS re:Invent 2023 – Deep dive on Amazon S3 (STG314)

The first implication of AZ-scope is that latency improves for both writing and reading.

Whenever you upload an object, S3 will break it into many pieces, or shards. The important part of this procedure is not the sharding itself, but how the shards are distributed. The distribution of the shards—and the healthy upkeep of this distribution—is what gives S3 its 11 9s of durability.

Even though the algorithm for the shard distribution is proprietary, we can guess at what properties must be maintained at all times:

Given some amount of faulty hardware in an AZ (disks, rack, etc.), your data must still be available.
Given the loss or partial loss of an AZ, your data must still be available.

Before your PUT request returns a 200, your object has been already been broken up and distributed in a way that meets these requirements. S3 would also employ heavy amounts of background workers to ensure these properties stay true at all times.

With S3 Express One, one improvement with write latency is that your object will only need to be sharded and distributed to a single AZ instead of a full region. This is important for latency-sensitive applications because even ~60 miles between AZs makes a difference.

For read latency, access time for data are improved if you’re able to co-locate compute with storage in the same AZ. Latency is further improved by new media types which we will discuss soon.

An interesting side-effect of Express One’s AZ-scope is that it may influence how companies choose to deploy compute. In order to be available in the face of AZ failure, you may have set up K8S clusters to deploy regionally (across many AZs) by default. However, if you want to get the performance benefits of S3 Express One, you will have to start being explicit with where your compute is deployed to and subsequently more careful with handling AZ failures.

100s of 1000s of Requests Per Second!

With classic storage classes, S3 gradually scales with you as your application accesses your objects more frequently. For a given prefix (/my-bucket/analytics/2023/jan_1.dump) the limit is 5.5K RPS on GETs and a similar order of magnitude for updates. Customers are empowered to scale by adjusting prefixes (/my-bucket/{some-distribution-of-values}/analytics/2023/jan_1.dump) in order to gain more throughput, e.g allocate 10 prefixes and get an aggregated 55K RPS.

The one drawback of this is that it does not support bursts of new traffic well. S3 doesn’t start by giving you 5.5K RPS; it needs time and an smooth upward slope of traffic to scale your prefix up. Given that S3 is marketing itself as the premiere storage for latency-sensitive and AI/ML applications, can’t they just give us more throughput from the get-go?

It turns out that Amazon S3 can easily give you this throughput without breaking a sweat. For many multi-tenant software services, there is a common partitioning scheme that software engineers follow:

Your application is getting off the ground. You make the reasonable choice to partition by some tenant_id or customer_id.
Your application is successful and now some tenants are orders of magnitude larger than others. Hot spots start popping up. Instead of re-architecting everything, you opt for some kind of selective partitioning, e.g. Kim Kardashian’s X account gets special treatment in the backend.

S3 has actually reached a third level of scale where #2 is no longer an issue for them. With Amazon’s ultra-scale, the aggregate of all of their workloads (no matter how sporadic and bursty they may be) will be smooth—and subsequently the partitioning scheme may be very simplistic.

AWS re:Invent 2022 – Deep dive on Amazon S3 (STG203)

S3 may be one of the few software services in this world that can do this.

My guess is that for classic buckets, S3’s partitioning schema might have looked something like {customer_id}_{bucket}_{object_prefix}_{hash_object_key_and_other_metadata}. If you imagine this being the partitioning scheme for a consistent hash ring, then it would be convenient for the architecture to slowly scale up for a given customer, bucket, or object prefix.

With S3 Express One, you are forced to use the new “Directory Bucket.” Under the hood, I believe this is actually just a way for the S3 architecture to utilize a different, probably more uniform, partitioning scheme, e.g {one_hash_of_many_fields} specifically for Express One objects.

Given Amazon’s trillion-object scale, they’ve grown beyond hotspots and have no problem distributing heavy amounts of your application’s load to their fleets. This is a scenario where insane scale actually makes things easier.

And if that wasn’t enough throughput for you, S3 has even more ways to increase parallelism such as multi-value DNS entries and multi-part upload requests.

Single-digit Millisecond Access with ShardStore

The release of S3 Express One represents Amazon’s commitment to support high-performance, latency-sensitive applications. Up until now, S3 has scaled to amazing levels with commodity HDDs. They have focused on durability and availability, but raw performance has never quite made it to the top of their requirements list—until now.

Assuming that the mechanics of HDDs are not sufficient for high-performance applications, S3 has shown signs of evolving to support new hardware, or in marketing terms “Storage Classes.” Managing persistence of brand new media types would necesitate significant changes in the software and data structures used. Said another way, it would not be realistic for S3 to keep their existing software and simply swap out the hardware.

For the last few years, Amazon S3 has been developing and gradually deploying “ShardStore“, a new Rust-based service dedicated to managing the storage of shards (pieces of your S3 objects) which I believe may be purposefully designed for SSDs. The linked paper focuses on verification techniques of ShardStore, but they include a short section on its design which we’ll review.

Introducing a New Style of LSM Trees

The following section assumes a basic understanding of LSM Trees and how they are used in databases. There are tons of great resources to learn more about this. If any term or concept is unfamiliar, please pause the blog and do a quick dive on the topic.

If S3 wants to consistently deliver the first bytes of data to you within single-digit milliseconds, they will need hardware that is exceptional at random access. SSDs have their own trade-offs (e.g. expensive), but the one area where they shine is with random access.

One challenge of LSM Tree databases is that there can be a high amount of write and read amplification. For a given write, layers of SSTables may be triggered for compaction. Similarly for a given read, layers of segments may need to be seeked to and scanned.

The paper “Internal Parallelism of Flash Memory-Based Solid-State Drives” produces some important test results regarding SSDs:

Concurrent read and write operations can cause “interference” that can negatively impact performance.
In good conditions, when random reads are issued concurrently, the aggregate throughput can match sequential throughput for some workloads.

Another other important property of SSDs worth mentioning is that their health and longevity is correlated with how often you write data due to the physics of the “Program/Erase” cycle.

Bringing all these points together, we come to the conclusion that classic LSM Tree data structures, with their heavy amounts of I/O amplification, may work effectively for HDDs, but would not be able to bring out the full potential of SSDs.

Value-Less LSM Trees

The only official publication I could find about ShardStore was in the paper “Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3.” An important note is that the design of ShardStore is heavily influenced by WiscKey (2016).

The main novelty here is that ShardStore uses an LSM Tree where the keys and values are separated, as opposed to conventional LSM Trees where both keys and values co-exist in one big logical tree.

In a sense, these new LSM Trees are “value-less”—instead of holding the KV value pairs, the tree maintains pointers to disk offsets where the actual values (data bits of your S3 object) reside. The primary advantage here is that the size of the tree is significantly reduced and subsequently write amplification may be significantly lower. Of course this new setup introduces its own challenges and complexities, but there must be a compelling reason why the S3 team has chosen this architecture!

Complexities and Their Potential Solutions

The TL;DR is that this technique is designed and optimized specifically for SSDs, and is (probably) the new core entity managing storage in S3 Express One.

For the remainder of this article, I’ll go into more detail regarding this technique’s complexities and potential solutions. I’ve summarized these to help my own understanding, but interested readers may benefit more from going directly to the papers.

Conceptualizing the Complexity—Managing Two Things Instead of One

In a classic LSM Tree, we conceptually think of one very large tree. The tree contains all KV pairs and is the data. This data might exist in a memtable or in various layers of SSTable segments.

Now take each KV pair in the tree, move the values into special regions on persistent storage, and replace those values with pointers to those regions. Conceptually, the database is no longer a tree, but the combination of a special LSM Tree and various regions of storage.

The following diagram (taken from Amazon’s paper) is a high-level overview of ShardStore’s architecture. An LSM tree still exists, but is now significantly smaller because it only containers pointers. This technique also enables a lot of flexibility with where the actual data can be placed, e.g. ShardStore might manage multiple physical SSDs drives and selectively “spread out” the writes on the SSD to promote hardware longevity.

To repeat the previous key point again—our conceptual “database” now consists of two essential components: the value-less LSM Tree and all the persistent storage it references.

An “extent” is a contiguous region of physical storage, where a standard disk may have tens of thousands of extents. It is also possible (but not verified) that a single instance of ShardStore may manage an array of physical media. Conceptually, this makes extents a logical layer of many “mini” or “virtual” drives backed by some number of physical drives.

When considering this approach, the software has to keep both the tree and the extents synchronized at all times. The tree, alongside all its pointers, is consistent with the chunks of data being persisted in extents.

What are Chunks?

In the diagram above, you’ll see that each shard is associated with N pointers. This is because ShardStore will actually break up your shards (which are already broken up pieces of your object) further into “chunks.” The chunks are what gets written into extents (the gray and blue squares).

This shows just how much S3 breaks down and distributes your data, i.e. you PUT an object, that object turns into shards spread across an AZ/Region, and those shards turn into chunks spread across extents.

If anything in the tree changes, a corresponding extent would need to be updated. Similarly, if a background process were to clean up an extent, then something in the original LSM Tree would need to be updated.

If you compare this to the classic LSM Tree, you may notice the delta in complexity. When a classic LSM Tree goes through background compaction, it’s just the merging of SSTables segments; no memory or special pointers need to be touched. When you write data, you edit the tree in the memtable and that is the only way an SSTable can be updated.

Now think about how merging and compaction might work with a ShardStore LSM Tree—it’s not as simple as just combining a few immutable files on disk.

Range Queries Still Work

One of the defining characteristics of classic LSM Trees is that all keys and values are in sorted order. To read a value, or range of values, we locate a particular SSTable (usually with some sparse index) and can perform a sequential scan (potentially across many layers of files) to get our content. This is not extremely fast, but it’s also not extremely slow as HDDs are quite good at sequential scans; most of the time spent on reads is for the seek.

As we mentioned earlier, ShardStore’s LSM tree contains just pointers; the actual data isn’t there anymore. Getting a range of key/values is no longer a simple sequential scan of one file; it now becomes potentially a bunch of random access queries into persistent storage. This would perform extremely poorly on HDDs.

The WiscKey paper mentions that when random access reads are issued concurrently on an SSD, the aggregate throughput can actually match sequential throughput. Said another way—SSDs are really good at random access reads.

This is one of the main reasons why it’s likely ShardStore’s value-less LSM Trees are purpose-built for SSDs. The internal parallelism built into SSD hardware can support this data structure’s access patterns whereas this setup would be completely untenable on spinning disks.

Reclaiming Space

In classic LSM Trees, the process of reclaiming space is achieved via background tasks which merge and compact SSTables. This process is functionally simple—it’s a pseudo merge-sort between files. The challenge with these tasks is that they may cause heavy amounts of write amplification and even get in the way of the database serving requests.

With ShardStore LSM Trees, we have one small LSM Tree that still goes through normal compaction and merge procedures. However, the actual bulk of space to be reclaimed will be in extents. ShardStore introduces a lightweight background “garbage collection” task that handles this.

In ShardStore, the extents are append-only. There is no way to immediately “delete” or “free-up” a chunk in an extent after a shard is deleted. However, we do have a way to mark a chunk as invalid. Consider a DELETE request for a given shardID, the only thing that has to happen is that the shardID’s entry in the LSM Tree is removed.

In the diagram above, the gray box in extent 18 is invalid, i.e. the shard that was referencing this chunk was deleted.

A reclamation task running on extent 18 will:

Iterate through every single chunk from the beginning of the extent
If the chunk is still referenced by the LSM Tree, it is still VALID and we append it to another extent
If the chunk is not referenced by the LSM Tree, it is INVALID and we continue on to the next chunk
Once all chunks are evaluated, the write pointer for that extent is reset and the space in extent 18 is reclaimed.

Not shown in the diagram is all the metadata that would also need to be tracked for this data structure to work. For example:

There must be a way to track all active “write pointers” for each extent.
There must be a way to do reverse look-ups on the LSM Tree.
There may be a way to track which extents are “active” or given priority over others for new shards.

As you can see, this process is quite different than a classic compaction/merging process for an LSM Tree!

Crash Consistency

Finally, another major difference between these two designs is crash consistency. How is everything left in a consistent state after unexpected failures?

In classic LSM Trees, this responsibility was reserved for the Write Ahead Log (WAL) that would be initially updated at the start of every single operation. ShardStore and WiscKey state that they do not bother with a WAL and still manage to ensure crash consistency. I won’t review the strategies here, but please check out the papers if you’re curious.

Conclusion

S3 Express One Zone is a huge step forward for Amazon. S3 is no longer a big bucket full of image and video files, but is becoming the insanely reliable, durable, and performant hard drive for the Internet. Every day a new company pops up with a “serverless” platform that leverages S3 under the hood. Amazon S3—alongside YouTube, Netflix, WhatsApp, etc.—continues to strengthen its presence as one of the modern wonders of the computing world.