What’s New in Icehouse Storage

The latest OpenStack 2014.1 release introduces many important new features across the OpenStack Storage services that includes an advanced block storage Quality of Service, a new API to support Disaster Recovery between OpenStack deployments, a new advanced Multi-Locations strategy for OpenStack Image service & many  improvements to authentication, replication and metadata in OpenStack Object storage.

Here is a Sneak Peek of the upcoming Icehouse release:

Block Storage (Cinder)
The Icehouse release includes a lot of quality and compatibility improvements such as improved block storage load distribution in Cinder Scheduler, replacing Simple/Chance Scheduler with FilterScheduler, advancing to the latest TaskFlow support in volume create, Cinder support for Quota delete was added, as well as support for automated FC SAN zone/access control management in Cinder for Fibre Channel volumes to reduce pre-zoning complexity in cloud orchestration and prevent unrestricted fabric access.

Here is a zoom-in to some of the Key New Features in Block Storage:

Advanced Storage Quality of Service
Cinder has support for multiple back-ends. By default, volumes will be allocated between back-ends to balance allocated space.  Cinder volume types can be used to have finer-grained control over where volumes will be allocated. Each volume type contains a set of key-value pairs called extra specs. Volume types are typically set in advance by the admin and can be managed using the cinder client.  These keys are interpreted by the Cinder scheduler and used to make placement decisions according to the capabilities of the available back-ends.

Volume types can be used to provide users with different Storage Tiers, that can have different performance levels (such as HDD tier, mixed HDD-SDD tier, or SSD tier), as well as  different resiliency levels (selection of different RAID levels) and features (such as Compression).

Users can then specify a tier they want when creating a volume. The Volume Retype capability that was added in the Icehouse release, extends this functionality to allow users to change a volume’s type after its creation.  This is useful for changing quality of service settings (for example a volume that sees heavy usage over time and needs a higher service tier). It also supports changing volume type feature properties (marking as  compressed/uncompressed etc.).

The new API allows vendor-supplied(or provided) drivers to support migration when retyping,. The migration is policy-based, where the policy is passed via scheduler hints.
When retyping, the scheduler checks if the volume’s current host can accept the new type. If the current host is suitable, its manager is called which calls upon the driver to change the volume’s type.

A Step towards Disaster Recovery
An important disaster recovery building block was added to the Cinder Backup API in the Icehouse release, to allow resumption in case your OpenStack cloud deployment goes into flames / falls off a cliff or suffers from any event that ends up with a corrupted service state. Cinder Backup already supports today the ability to back up the data, however in order to support a real disaster recovery between OpenStack deployments you must be able to have a complete restoration of volumes to their original state including Cinder database metadata. Cinder Backup API was extended in Icehouse to support this new functionality with the existing backup/restore api.

The new API supports:
1. Export and import backup service metadata
2. Client support for export and import backup service metadata

This capability sets the scene for the next planned step in OpenStack disaster recovery, that will be designed to extend the Cinder API to support volume replication (that is currently in the works for Juno release).

OpenStack Image Service (Glance)

The Icehouse release has many image service improvements that include:

  • Enhanced NFS servers as backend support, to allow users to configure multiple NFS servers as a backend using filesystem store and mount disks to a single directory.
  • Improved image size attribution was introduced to solve the file size or the actual size of the uploaded file confusion, by adding  a new virtual_size attribute. Where `size` refers the file size and `virtual_size` to the image virtual size. The later is useful just for some image types like qcow2. The Improved image size attribution eases the consumption of its value from Nova, Cinder and other tools relying on it.
  • Better calculation of storage quotas

 Advanced Multi-Location Strategy
The support for Multi-locations was introduced to Glance in the Havana release, to enable image domain object fetch data from multiple locations and allow glance client to consume image from multiple backend store. Starting in the Icehouse release,  a new image location selection strategy was added to the Glance image service to support a selection strategy of the best back-end storage. Another benefit of this capability is the improved consuming performance, as the end user,  can consume images faster, both in term of  ‘download’ transport handling on the API server side and also on the Glance client  side, obtaining locations by standard ‘direct URL’ interface.  These new strategy selection functions are shared between API server side and client side.

OpenStack Object Storage (Swift)
Although the biggest Swift feature  (Storage Policies) that was set for Icehouse is planned  only tp land in Juno, there were many other improvements to authentication, replication and metadata. Here is a zoom-in to some of the key new features you can expect to see in Swift with the Icehouse release:

Account-level ACLs and ACL format v2 (TempAuth)
Accounts now have a new privileged header to represent ACLs or any other form of account-level access control. The value of the header is a JSON dictionary string to be interpreted by the auth system.

Container to Container Synchronization
A new support  for sync realms was added to allow for simpler configuration of container sync. A sync realm is a set of clusters that have agreed to allow container syncing with each other  as all the contents of a container can be mirrored to another container through background synchronization. Swift cluster operators can configure their cluster to allow/accept sync requests to/from other clusters, and the user specifies where to sync their container along with a secret synchronization key.

The key is the overall cluster-to-cluster key used in combination with the external users’ key that they set on their containers’ X-Container-Sync-Key metadata header values. These keys will be used to sign each request the container sync daemon makes and used to validate each incoming container sync request. The swift-container-sync does the job of sending updates to the remote container. This is done by scanning the local devices for container databases and checking for x-container-sync-to and x-container-sync-key metadata values. If they exist, newer rows since the last sync will trigger PUTs or DELETEs to the other container.

Object Replicator – Next Generation leveraging “SSYNC”
The Object Replicator role in Swift encapsulates most logic and data needed by the object replication process, a new replicator implementation set to replace good old RSYNC with backend PUTs and DELETEs.  The initial implementation of object replication simply performed an RSYNC to push data from a local partition to all remote servers it was expected to exist on. While this performed adequately at small scale, replication times skyrocketed once directory structures could no longer be held in RAM.

We now use a modification of this scheme in which a hash of the contents for each suffix directory is saved to a per-partition hashes file. The hash for a suffix directory is invalidated when the contents of that suffix directory are modified.

SSYNC is a thin recursive wrapper around the RSYNC. Its primary goals are reliability, correctness, and speed in syncing extremely large filesystems over fast, local network connections. Work continues with a new SSYNC method where RSYNC is not used at all and instead all-Swift code is used to transfer the objects. At first, this SSYNC will just strive to emulate the RSYNC behavior. Once deemed stable it will open the way for future improvements in replication since we’ll be able to easily add code in the replication path instead of trying to alter the RSYNC code base and distributing such modifications. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS – It is an experimental feature. In its current implementation it is probably going to be a bit slower than RSYNC, but if all goes according to plan it will end up much faster.

Other notable Icehouse improvements that were added include:

Swift Proxy Server Discoverable Capabilities to allow clients to retrieve configuration info programmatically.  The response will include information about the cluster and can be used by clients to determine which features are supported in the cluster. Early Quorum Responses that allow the proxy to respond to many types of requests as soon as it has a quorum.  Removed python-swiftclient dependency,  added support for system-level metadata on accounts and containers, added swift-container-info and swift-account-info tools,  and various bug fixes such as fixing the ring-builder crash when a ring partition was assigned to deleted, zero-weighted and normal devices.

Other key Swift features that made good progress in Icehouse and will probably land in the Juno release include:  Erasure Coded Storage in addition to replicas that will enable a cluster to reduce the overall storage footprint while maintaining a high degree of durability, and Shard large containers – as containers grow, their performance suffers. Sharding the containers (transparently to the user) would allow the containers to grow without bound.