Swift in the Small

June 27, 2011 Swift in the Small

For the upcoming OpenStack meetup the theme is ‘Corporate IT’. This got me thinking about what a small-scale Object Storage (Swift) cluster would look like.

At Cloudscaling, we have already done two of the early large-scale OpenStack Object Storage deployments outside of Rackspace. These deployments were for service providers at the petabyte scale.

We had 80-100TB staging environments, but those can still be big entry points for some shops.

I wanted something small in the 10’s of TB range that would be useful or corporate IT or for web/app shops that for whatever reason, don’t use public clouds. There is a lot of great tooling available for object storage systems that private deployments can take advantage of. So the challenge was to design a Swift cluster that could start-out with a single node (4-16 TB) and expand up to 4 nodes (32-144 TB).

Why is this a challenge? — Zones

Zones
Swift is designed for large-scale deployments. The mechanisms for replication and data distribution are built on the concept that data is distributed across isolated failure boundaries. These isolated failure boundaries are called zones.

Unlike RAID systems, data isn’t chopped up and distributed throughout the system. Whole files are distributed throughout the system. Each copy of the data resides in a different zone.

As there are 3 copies of the data, at least 4 zones are required. Preferably 5 zones (so that 2 zones can fail).

Racks or Nodes as Zones
In the big clusters, failure boundaries can be separate racks with their own networking components.

In medium deployments, a physical node can represent a zone.

Drives as Zones
For smaller deployments with fewer then 4 nodes, drives need to be grouped together to form pseudo-failure boundaries. A grouping of drives is simply declared a zone.

Here is a scheme for starting small and growing the cluster bit-by-bit (well.. terabyte-by-terabyte).

1 Storage Node

For a single storage node the minimum configuration would have 4 drives for data + 1 boot drive.
If a single drive fails, it’s data will be replicated to the remaining 3 drives in the system.

The system would grow, 4-disks at at time (one in each zone) until the chassis was full.

2 Storage Nodes

The strategy here is to split the zones evenly across the two nodes.

The addition of an additional node does increases availability (assuming that load balancing is configured), but it does does not create a master-slave configuration. If one of the nodes is down ½ of your zones are unavailable.

The good news is that if one of the nodes is down (½ of your zones), data is still accessible. This is because because at least one of the zones will still up on the remaining node.

The bad news is that there is still a 1 in 2 chance that writes will fail because at least two of three zones need to be written to for the write to be considered successful.

3 Storage Nodes

The addition of a third node further enables distribution of zones across the nodes. Something strange is going on here by putting whole zones in each node, but breaking up zone 4 into thirds and distributing across the three nodes. This is done to enable smoother rebalancing when going to 4 nodes.

Again, if a single node is down, data will be available, but there will be a 1 if 5 chance that a write would fail.

4 Storage Nodes

The strategy of breaking up Zone 4 into thirds with 3 nodes, is to make this transition easier. The cluster can be configured with zone 4 entirely on that new server, then the remaining zones can slowly be rebalanced to fold-in the newly vacated drives on their node.

Now, if a single node fails, writes will be successful as at least two zones will be available.

Why Small-Scale Swift?
Using OpenStack Object Storage is a private-cloud alternative to S3, CloudFiles, etc. This enables private cloud builders to start out with a single machine their own data center and scale-up as their needs grow.

Why not use RAID?
Why not use a banana? :) It’s a different storage system, used for different purposes. Going with a private deployment of Object Storage gives something that looks and feels just like Rackspace Cloud Files. App developers don’t need to attach a volume to use the storage system and assets can be served directly to end users or to a CDN.

The bottom line is that a small deployment can transition smoothly into a larger deployment. The great thing about OpenStack being open-source software is that it gives us the freedom to build and design systems however we see fit.

8 comments
Posted under Cloud, OpenStack

Permalink # lotlwc said

Awesome post. What sort of CPU/RAM specs do you use in these nodes? Are you using 2TB or 3TB SATA spindles? Do you have a target cost per GB per month when building one of these clusters?

Permalink # joearnold said

Of course the answer is “it depends”. I tend to over provision a bit on RAM/CPU as these clusters are generally used for web-based workloads with a lot of concurrent requests. A good price/performance CPU (e5620) and 24/48 GB of RAM for a 36/48 drive chassis is what I’ve provisioned. You’ll be able to get away with thinner provisioning, but expect extended recovery times during recovery scenarios.

At scale we target .50-.70 cents / GB with triplicate redundancy and a reasonable amount of front-end “head” units to serve requests.

- Permalink # lotlwc said
  
  Thanks for your response. You mention extended recovery times depending based on the CPU/RAM config, do you see big differences here? For example does the difference between 24 and 48GB RAM in the nodes, have 50% recovery variance of more like 10% (of course I’m not expecting an exact answer, just want to get a feel for resource allocation vs performance).
  
  When you say you target .50-.70cents per GB, you mean capital buy price? So you would take that figure, multiple it by three (for redundancy), add overhead, and divide by 36months to get rough monthly costs over three years?
- Permalink # joearnold said
  
  No. That’s .50-.70 cents a GB usable. Yes, that’s for capital buy price — it doesn’t include rent, power, transit, etc.
  
  As to your first question, if you’re looking to squeeze RAM/CPU down, at least profile convergence times with the configuration. You can watch your cycle times on the replicator processes or the object-auditor (if you’re running it). Also, it depends on how active your object stores are and how many account/container/object processes are required to satisfy the number of requests. This is definitely an area that could use more benchmarking.

Permalink # SquareCows.com » Community Weekly Newsletter (June 24 – July 1) said

[…] Swift in the Small by Joe Arnold – http://joearnold.com/2011/06/27/swift-in-the-small/ […]

Thanks a lot for your reply. Its been useful info to get a bit of a picture to start playing with this stuff.

Permalink # Nguyen Thinh said

Great post!
If i start with one node(4 hdds); and the system grow, could i add one more node?

Permalink # joearnold said

Hi Nguyen,
Yes, what you would need to do is either add additional nodes, or break out machines into separate zones. Don’t forget to ‘squeeze’ data by adjusting their weights down to 0 for the drives you’re going to repurpose into other zones. Keep me posted.

Joe Arnold's Blog

June 27, 2011 Swift in the Small

Write a new comment 8 comments

Permalink # lotlwc said

Permalink # joearnold said

Permalink # lotlwc said

Permalink # joearnold said

Permalink # SquareCows.com » Community Weekly Newsletter (June 24 – July 1) said

Permalink # lotlwc said

Permalink # Nguyen Thinh said

Permalink # joearnold said

Leave a reply to joearnold Cancel reply

Joe Arnold's Blog

June 27, 2011 Swift in the Small

Share this:

Related

Write a new comment 8 comments

Permalink # lotlwc said

Permalink # joearnold said

Permalink # lotlwc said

Permalink # joearnold said

Permalink # SquareCows.com » Community Weekly Newsletter (June 24 – July 1) said

Permalink # lotlwc said

Permalink # Nguyen Thinh said

Permalink # joearnold said

Leave a reply to joearnold Cancel reply