Data wrangling for dummies

BitDepth#936 for May 13

Over the last three weeks two large hard drives died in my 22TB archiving system. This four bay drive case will anchor my revamped backup regime. Image courtesy Other World Computing.
Over the last three weeks two large hard drives died in my 22TB archiving system. This four bay drive case will anchor my revamped backup regime. Image courtesy Other World Computing.

By the time you read this I’ll be hip deep in hard drives, wading my way through terabytes of data again, searching for that mythical grail, the perfect digital backup.

I’ve written before about the importance of backing up your digital files (type backup in the field that appears), which largely exist as abstract concepts until they are gone forever, but there are a growing number of people who find themselves managing large datasets with no prior knowledge of what’s at stake.

While the general profile of a computer user suggests someone who creates a dozen or more files in a word processor and perhaps opens or edits a PowerPoint or two, the reality is that people are increasingly banking their lives in unsafe digital deposit boxes.

Prefer your music to be digital or on CD? That’s a half-gigabyte gone there already for the average serious music listener who wants files with minimal compression.

Taking a few snaps with your phone and moving them to your computer? Dabbling in digital photography? Set aside another terabyte or a major part thereof for your growing collection of snaps.

Want to rip your movies to disc so you can view them where you want to? Listen to audiobooks? Love to hoard copies of your favorite podcast? That’s another half-gigabyte right there.

It doesn’t take a professional or even a serious amateur to begin to fill a drive capable of holding two thousand gigabytes of data.

If you begin to seriously play around with 3D animation, music creation and production, high-end digital photography or video creation, you can quickly double or triple that need for space.

Mix into that the confusion that most users have about backup. I constantly meet people who have moved their data to an external drive or optical disc and deleted it from their systems. That isn’t a backup; it’s a transfer. A backup needs to exist and be verified as valid on two different storage systems.

Professionals don’t stop there. I’m not happy until critical files are in quadruple redundancy, and I’m currently planning an additional instance.

When you work with large datasets, you need to think about backup in tiers.

I use an incremental backup system that’s built into every Mac called Time Machine. Alternatives exist for PCs.

Time Machine continuously looks at your working drives and saves a copy of every changed document on a quite brisk schedule.

In my tower, it’s backing up a 240GB and 120GB SSD and two 2TB hard drives and to make all that fit on a 4TB drive, I’d created a large exclusion list that I came to regret when one of the storage drives, the one that holds my work in progress, died at dawn two weeks ago.

The restoration took four hours, but then I had to face all the stuff I hadn’t set for backup and realised that changes were called for.

Time Machine works with large datasets, but requires a commensurately larger pool to keep all its versions. My new Time Machine safety net is a striped RAID 0 8TB drive built out of two 4TB drives.

I plan backup in three tiers. The incremental backup is a safety net for all the stuff that’s active on the system.

At the next level is the nearline archive, normally on large hard drives on a fast connection that’s available at the flick of a switch. At this level I mirror (rather crudely, I must admit) all the data to two drives from two different manufacturers.

Then there is the deep archive, where all the data comes to rest finally on hard drives and soon, BluRay optical disks. Someday soon cloud storage, probably with CrashPlan, will join that list.

Consider backup tiers in terms of the ageing of data.

The first tier, or safety net backup level meets the need for reasonably reliable backup of work in progress. From here you can recover work that’s minutes or hours old.

Most content creators will need their data nearby, and that’s what the second tier serves. In my case, I reference it for images that are more than two years old when projects or clients require them. Most of this work is months or years old.

For deep archive, off-location is best and can be done with slower media like cloud storage or even LTO tape. Deep archives may miss more recent work, but otherwise track data all the way back to the beginning of the backup process.

Some warnings. All media will die eventually. Hard drives crash, and enterprise class drives seem to last no longer than consumer versions. Optical media becomes unreadable. Media formats become obsolete –remember Jaz, Syquest, Zip?

Unremitting paranoia and distrust are your best friends in data management. Current best practices for data migration suggest that three years is the outer limit for trusting the reliability of a hard drive. Optical media runs to between five and ten years. So if you live by your media files, backup, then backup again and migrate on a set schedule.


Google on media reliability (PDF)

ZD Net discusses Backblaze’s media reliability claims with links to the source material.