June 27, 2011 Swift in the Small
For the upcoming OpenStack meetup the theme is ‘Corporate IT’. This got me thinking about what a small-scale Object Storage (Swift) cluster would look like.
We had 80-100TB staging environments, but those can still be big entry points for some shops.
I wanted something small in the 10’s of TB range that would be useful or corporate IT or for web/app shops that for whatever reason, don’t use public clouds. There is a lot of great tooling available for object storage systems that private deployments can take advantage of. So the challenge was to design a Swift cluster that could start-out with a single node (4-16 TB) and expand up to 4 nodes (32-144 TB).
Why is this a challenge? — Zones
Swift is designed for large-scale deployments. The mechanisms for replication and data distribution are built on the concept that data is distributed across isolated failure boundaries. These isolated failure boundaries are called zones.
Unlike RAID systems, data isn’t chopped up and distributed throughout the system. Whole files are distributed throughout the system. Each copy of the data resides in a different zone.
As there are 3 copies of the data, at least 4 zones are required. Preferably 5 zones (so that 2 zones can fail).
Racks or Nodes as Zones
In the big clusters, failure boundaries can be separate racks with their own networking components.
In medium deployments, a physical node can represent a zone.
Drives as Zones
For smaller deployments with fewer then 4 nodes, drives need to be grouped together to form pseudo-failure boundaries. A grouping of drives is simply declared a zone.
Here is a scheme for starting small and growing the cluster bit-by-bit (well.. terabyte-by-terabyte).
For a single storage node the minimum configuration would have 4 drives for data + 1 boot drive.
If a single drive fails, it’s data will be replicated to the remaining 3 drives in the system.
The system would grow, 4-disks at at time (one in each zone) until the chassis was full.
2 Storage Nodes
The strategy here is to split the zones evenly across the two nodes.
The addition of an additional node does increases availability (assuming that load balancing is configured), but it does does not create a master-slave configuration. If one of the nodes is down ½ of your zones are unavailable.
The good news is that if one of the nodes is down (½ of your zones), data is still accessible. This is because because at least one of the zones will still up on the remaining node.
The bad news is that there is still a 1 in 2 chance that writes will fail because at least two of three zones need to be written to for the write to be considered successful.
3 Storage Nodes
The addition of a third node further enables distribution of zones across the nodes. Something strange is going on here by putting whole zones in each node, but breaking up zone 4 into thirds and distributing across the three nodes. This is done to enable smoother rebalancing when going to 4 nodes.
Again, if a single node is down, data will be available, but there will be a 1 if 5 chance that a write would fail.
4 Storage Nodes
The strategy of breaking up Zone 4 into thirds with 3 nodes, is to make this transition easier. The cluster can be configured with zone 4 entirely on that new server, then the remaining zones can slowly be rebalanced to fold-in the newly vacated drives on their node.
Now, if a single node fails, writes will be successful as at least two zones will be available.
Why Small-Scale Swift?
Using OpenStack Object Storage is a private-cloud alternative to S3, CloudFiles, etc. This enables private cloud builders to start out with a single machine their own data center and scale-up as their needs grow.
Why not use RAID?
Why not use a banana? :) It’s a different storage system, used for different purposes. Going with a private deployment of Object Storage gives something that looks and feels just like Rackspace Cloud Files. App developers don’t need to attach a volume to use the storage system and assets can be served directly to end users or to a CDN.
The bottom line is that a small deployment can transition smoothly into a larger deployment. The great thing about OpenStack being open-source software is that it gives us the freedom to build and design systems however we see fit.