Sunday, July 3, 2011

Cheap, fast and secure storage

This post if about how I protect important data at home. There are a lot of appliances out there with nice front ends but most of them do not store data on ZFS or something with the same level of data protection. Also, if you have anything but trivial space requirements they tend to get expensive fast and don't have the option to easily enhance performance by for example adding an SSD as file cache or adding other hardware. I've built my ZFS storage server based on an quad core AMD CPU, 8GB of ECC memory, a SAS HBA and a bunch of large SATA disks in hot-swap bays.

All my data now lives in this storage server protected against accidental delete, bit rot, disk failures and fire. I use NFS/CIFS for ordinary file data and iSCSI for my Aperture photo libraries and time machine backups.

All data is stored in one large raidz2 pool, so there are two parity disks allowing any two disks to fail without data loss. Since ZFS checksums all data I know it is intact when read and after bi-weekly data scrubs. The most important reason for using raidz2 is that disks have now become so big that there is now a real risk that there will be an unrecoverable read error during a resilver when all data is read, using another parity disk makes this highly unlikely. Snapshots makes it possible to do a quick rollback if any person or software should damage or remove data, this is also very useful when transforming large amounts of data with uncertain outcome. This will keep data safe from most user errors, disks errors and controller errors but fire and major user errors/sabotage (zfs destroy -r) could still make me loose data.

To avoid the later two scenarios I mark my most important datasets with a flag and a script streams them using ZFS send/receive to external disks over eSATA. The disks are then transferred to a second physical location. I can currently fit all critical data on one large SATA disk which makes this cheap and easy. I exclude ISO images and virtual machine disks that I only use for testing. A full backup of critical data takes about 3 hours today, that depends on the backup disks which can write data at about 80-90MB/s. By using incremental ZFS send the time goes down considerably as only the delta between the snapshots need to be transferred.

To be able to recover individual files and recover parts of data even if the disk have errors the streams are sent to a zpool on the backup disk. By using several disks I have at least one at another location and it also gives me multiple backup versions. I was considering placing encrypted ZFS streams on the disks but it is then not possible to recover individual files and if the stream is damaged it becomes useless.

In an ideal world I would have another node set up that receives the incremental ZFS streams over the net, but that is overkill for my current usage and I have no secondary site with good bandwidth (and another storage server).

This setup gives me the following redundancy:
  • Integrity of all data is verified every two weeks
  • Data has several read-only snapshots from different times
  • Data is protected by two disk parity raidz2
  • Accessed data is always verified by checksums
  • Offsite backups allow disaster recovery
  • Backups are also checksummed
  • Memory is ECC protected to prevent data corruption
This is all fine, but there is still one single point of failure, if a serious error would creep into the ZFS code it could be replicated to all snapshots and pools, but given the amount of testing ZFS has gone trough it seems unlikely. Here tape backups over NDMP would be of good use but since I do not have any tape hardware all important data is copied with rsync to a disk with an old-school filesystem once every other month.

On top of this I also take advantage of other ZFS features, a cheap SSD is used as L2ARC to accelerate various workloads and compression/de-duplication is as always available with ZFS. It is also possible to add new hardware to the setup without buying a different server or license, 10GbE, Fibre channel, more SSD caches and more RAM for cache/dedup can easily be added, that would probably not be possible with a pre-built NAS appliance or at least not as cheap.

I am evaluating the beta of OpenIndiana 151 on the storage server after upgrading from the now dead OpenSolaris distribution (I would not have tested a beta release without all these backups in place), so far everything works fine. Solaris 11 Express can also be used but that requires a license from Oracle that costs about $1000/year but it will give you ZFS crypto and a few other ZFS features not available in the open ZFS code base.

All the technical features are better than most storage appliances but OpenIndiana/Solaris 11 Express provides no web based administration, there are however add-on software such as nap-it available and commercial ZFS software appliances such as NexentaStor which has a free community edition for up to 18TB of used storage.

I have worked with designing and implementing various similar solutions from small office filers to larger data archives with 96-disk. I work part time as a consultant so I am available to assist in similar projects.

2 comments:

piersdd said...

I'm curious about your opinion of the performance of Aperture libraries, over iSCSI with SSD L2ARC, to an 8 drive Z2 pool.
I setup a similar machine about a year ago using Nexenta, and a couple of SLC SSDs for sZIL. I had very high expectations for performance that in the end, I do not think were met. Sadly I can't remember specific throughputs.
I'd also be very interested in what block size you would recommend for the HFS volume, and how you aligned those, to the iSCSI block size.
Regards,
Piers
Tasmania, Australia

Henkis said...

I am not doing any heavy work in Aperture, but even untuned it feels snappy. If I do a batch conversion i get between ~40 to 75MB/s over gigabit ethernet and I know the networks stacks could need some tuning.

I can se some small latency (fractions of seconds) if i scroll fast in the browse mode showing thousands of previews that have not been displayed before.

What kind of problems did you encounter, slow overall or for specific tasks? How much memory did your storage node have?

Overall it feels even faster when the ARC and L2ARC is warm.