The blog of dlaa.me

Search: backup

Back to backup [Revisiting and refreshing my approach to backups]

A topic that comes up from time to time is how people deal with backing up their data. It's one of those times again, so I'm sharing my technique for the benefit of anyone who's interested. My previous post on backup strategy was written over 11 years ago (!), and it's surprising how much is still relevant. This post expands on the original and includes my latest practices. Of course, we all have different priorities, restrictions, and aversion to data loss, so what I describe here doesn't apply universally. Just take whatever's relevant and ignore the rest. :)

Considerations

Dependability - Hardware eventually fails. Things with moving parts tend to die sooner, but even solid-state drives have problems in the long run. Sometimes, the risk of failure is compounded because a storage device is so tightly integrated with the computer that it can become unusable due to failure of an unrelated component. And even if hardware doesn't break on its own, events like a lightning strike can take things out unexpectedly.

Accidental deletion or overwrite - As careful as one might be, every now and then it's possible to accidentally delete an important file. Or maybe just open it, make some inadvertent edits, and absentmindedly save them. This is a scenario many backup strategies don't account for; if an unwanted change to the original data is immediately replicated to a backup, it can be difficult to recover from the mistake.

Undetected corruption - A random hardware or software failure (or power loss) can invisibly result in a garbled file, directory, or disk. The challenge is that corruption can go undetected for a long time until something happens to need the relevant bits on disk. It's easy to propagate corruption to a backup copy when you don't realize it's present.

Local disaster - Fire and tornadoes are rare, but can destroy everything in the house. Even with a perfect a backup on a spare drive, if that drive was in the same place at the time of the disaster, the original and backup are both lost. Odds of avoiding the problem are improved slightly by keeping drives in different rooms, but a sufficiently serious calamity will not be deterred.

Large-scale disaster - The likelihood is quite low, but if a flood (as a timely example) comes through the area, it won't matter how many backups are stored around the house because all of them will be underwater. The only protection is to store things offsite, preferably far away.

Security/privacy - Most people won't be the target of industrial espionage, but a nosy house guest can be just as problematic - especially so because they have prolonged access to your data. If thieves steal a computer, they can read its drives. Bank records, tax documents, source code, etc., can all be compromised if precautions haven't been taken to encrypt the original and all backups.

Bandwidth - Many people use services that store data in the cloud. With a fast-enough connection to the Internet, this works well. But typical Internet plans have limited upstream bandwidth, meaning it can take hours to upload a single video. Things catch up over the course of a few days, but the latest data can be lost if service is abruptly cut off.

Validation - In order to be sure backups are successful, it's necessary to test them - regularly. Otherwise, you might not be as well prepared as you thought. Some backup approaches are easy to verify, others not so much.

Ease of recovery - In the event of a complete restore from backup, things have probably gone very wrong and any additional hardship is of little concern. That said, having immediate access to all data is a nice perk.

Cross-platform support - Whether you prefer Windows, Mac, or Linux, you'll probably stay with your OS of choice after catastrophes strikes. But it's nice to have a backup strategy where data can be recovered from different platform. If your threat model includes a global virus that wipes out a particular operating system, cross-platform support is important.

Implementation

All drives storing data I care about are encrypted with BitLocker. This mitigates the risk of theft/tampering and means I can leave backup drives anywhere. Each backup drive is a 1 or 2 TB 2.5" external USB drive. These devices are nice because they don't need a separate power supply and they're inherently portable and resilient - especially when stored inside a waterproof Ziploc bag.

At the end of each day, I mirror the latest changes from the primary drive in my computer to a backup drive sitting next to it. This step protects against hardware failures (high risk) and this is the drive I'll take if I need to leave home in a hurry.

Every couple of months, I calculate the checksum of every file and save them to the drive. Then I mirror to the backup, calculate checksums for everything on the backup, and make sure all the checksums match. This makes sure every bit on both drives is correctly written and readable. Any files that weren't changed help detect bit rot because those checksums are stable. Once verified, I swap the backup drive with another just like it at a nearby location (ex: friend's house, safe deposit box, etc.). This protects against theft or fire (lower risk).

Once a year, I swap the latest backup drive with another drive in a different geographic region (ex: relative's house). This protects against large-scale disaster like a flood (much lower risk), so the fact that it can be up to a year out of date is acceptable.

Because I've been doing this for a while, I also have a couple of spare drives that don't get updated regularly. These provide access to even older backups and can be used to recover outdated versions of a file in the event of persistent, undetected corruption or significant operator error.

Conclusion

This system addresses the above considerations in a way that works well for me. Extra redundancy provides peace of mind and requires very little effort most of the time. Cost is low and every step is under my control, so I don't need to worry about recurring fees or vendors going out of business. Recovery is simple and my family knows how to access backups if I'm not available. Remote backup drives consume no power, can live in the back of a drawer, and can be lost or destroyed without concern.

Nota bene

When I did the math on storage requirements a decade ago, I was concerned about capacity needs outpacing storage advances. But that didn't happen. In practice, the price of a drive with room for everything has stayed around (or below) $100. I buy a new drive every couple of years, meaning the amortized cost is minimal.

Tags: Technical

A brief bit 'bout backups [My current backup strategy]

I've seen a few references to backup strategies on blogs and discussion lists lately and thought I'd write a bit about the strategy I recently decided on and implemented. Of course, everyone has their own approach to file management, their own comfort level for security, and their own ideas about what's "best". That's life and I'm not going to try to persuade anyone that my way is better than their way - but I will outline my way in case it's useful for others, too. :)

The setup: My machine is running Windows 2003 Server and I try to keep as much unnecessary stuff off it as possible (no games, no P2P programs, no weird drivers, etc.). Along the same lines, all user accounts on the server are members of the restricted access Users group, not the Administrators group. The machine has one hard drive for storing the operating system and all programs (60 GB) and another hard drive for storing all data (320 GB). The data drive has a Mirror directory under which all data to be backed up is stored. The Mirror directory is ACLed to allow the Users group read/write access. Non-private subdirectories of it are shared out for read-only access by Users. I have an external USB 2.0 drive enclosure for backing up to (200 GB) that is normally powered off and that I mirror the Mirror directory to every couple of days or so. The external drive is ACLed to allow only members of the Backup Operators group to make changes. My data consists of the usual personal stuff (email, source code, etc.), all digital photos I've ever taken, all digital video I've ever taken, sentimental stuff (like wedding videos, baby's ultrasound video, etc.), and some of my music collection in WMA Lossless format. Very little data changes day-to-day, so a simple tool like RoboCopy (free with the Windows 2003 Resource Kit) is more than enough to keep the backup directory in sync (use RoboCopy's /MIR switch to make this easy). Along with the rest of the data is a file that records the MD5 hash of every file in the backup. As my data storage needs increase (which they do each time I take a picture or shoot a video!), I'll eventually buy a new large hard drive and swap it for the smallest of the two data drives currently in use. As long as my storage needs don't grow too rapidly, I'm figuring the cost of upgrading to be about $100 each year (that's the cost of a mid-sized drive like the 320 GB I purchased a few months ago). I'm counting on storage capacity to continue increasing like it has so that I'll always be able to buy $100 drives when I need to increase the storage space.

Benefits provided by this approach:

  • All the data I care about is stored in two independent locations, so there's no single point of failure. (Duh, that's why it's a backup.)
  • Hard drive media doesn't suffer from the same "bit rot" problems that can render writable CDs/DVDs unreadable after just a couple of years.
  • The backup drive is completely separate from the primary drive, so if I ever make a mistake and delete something important, I can easily recover it from the backup. (Some RAID-based solutions immediately mirror all changes and therefore don't have this benefit.) Similarly, a destructive virus on my main machine can't immediately destroy all copies of any data.
  • I look over the list of changes whenever I perform the mirroring to the external drive, so I have an additional opportunity to catch accidental deletions, mysterious changes, etc..
  • I have immediate access to all of my data from any machine in my home. If I decide to look at old photos, I can access them just as easily as the photos I took yesterday.
  • All family members store their data under the Mirror directory (via appropriately ACLed shares), so everybody's data is automatically backed up.
  • In the event of a slow-moving catastrophe (ex: a flood) I can easily grab the external backup drive and take it with me wherever I go. All data will be accessible from any other Windows computer in the world.
  • The overall cost was minimal to set up (~$100) and should be minimal to maintain (~$100/year).
  • Data is separate from applications, so I can reinstall or upgrade the operating system whenever I want without worrying about the data itself.
  • User accounts have limited privileges and are therefore less likely to accidentally compromise the machine when reading email or surfing the web.
  • The MD5 hashes mean that it's easy to verify the contents of my backup drive and that I'll be able to detect data corruption problems if they ever happen.
  • The backup drive is ACLed so that I can't accidentally delete data on it.

Problems this approach does not solve:

  • Both drives are at the same physical location, so all data can be lost in the event of a sudden catastrophe (ex: fire, earthquake). Possible mitigation: Set up a third external drive (after the first upgrade) and keep that drive somewhere far away. It may not be big enough to hold everything, but I'm happy to exclude music from the off site backup. Drawback: Inconvenience of updating the off site drive.
  • "Old data" is lost quickly. For example: if I accidentally delete an important file, I need to detect that mistake at the time of the next mirroring or else that file is gone for good. Possible mitigation: Multiple backup drives at staged intervals (ex: 1 week, 1 month, 3 months). Drawback: Cost.
  • A thief who steals the computer or external drive might have access to personal data. Possible mitigation: Encryption. Drawback: Inconvenience of decrypting files to use them and/or backing up EFS keys.
  • This solution may not scale well if my data storage needs increase faster than storage technology does. Possible mitigation: Move to a different backup strategy. Drawback: That strategy will have its own problems.

I think this overview touches on pretty much all of the key points of this strategy. It's obviously not a perfect solution, but it meets most of my requirements and I'm pretty happy with how it's been working out so far. However, I'm always open to improvements - if you have any suggestions, I'd love to hear them!

Tags: Technical

"If you want to see the sunshine, you have to weather the storm." [A simple AppleScript script to backup Notes in iCloud]

As someone who likes to make backups of all my data, storing things "in the cloud" is a little concerning because I have no idea how well that data is backed up. So I try to periodically save copies of cloud data locally just to be safe. This can be easy or hard depending on how that data is represented and how amenable the provider is to supporting this workflow. In the case of Apple Notes, it's easy to find questionable or for-pay solutions, but there's a lot out there I don't really trust. You might think Apple's own suggestion would be the best, but you might be wrong because it's rather impractical for someone with more than a handful of notes.

As a new macOS user, there's a lot I'm still discovering about the platform, but something that seems well suited for this problem is AppleScript and the AppleScript Editor, both of which are part of the OS and enable the creation of scripts that interact with applications and data. This Gist by James Thigpen proves the concept of using AppleScript to backup Notes, but of course I wanted to do some things a little differently and so I made my own script.

Notes:

  • This script enumerates all notes and appends them to a single HTML document that it puts on the clipboard for you to paste into the editor of your choice and save somewhere safe, email to yourself, or whatever.
  • The script does very little formatting other than separating notes from each other with a horizontal rule and adding a heading for each folder name. The output should be legible in text form and look reasonable in a web browser, but it won't win any design awards.
  • The contents of each note are apparently stored by Notes as HTML; this script adds the fewest number of additional tags necessary for everything to render properly in the browser.
  • If your OS language is set to something other than English, you may need to customize the hardcoded folder name, "Recently Deleted". (Or remove that conditional and you can backup deleted notes, too!)
  • This was my first experience with AppleScript and I'm keeping an open mind, but I will say that I did not immediately fall in love with it.
set result to "<html><head><meta charset='utf-8'/></head><body>" & linefeed & linefeed
tell application "Notes"
  tell account "iCloud"
    repeat with myFolder in folders
      if name of myFolder is not in ("Recently Deleted") then
        set result to result & "<h1>" & name of myFolder & "</h1>"
        set result to result & "<hr/>" & linefeed & linefeed
        repeat with myNote in notes in myFolder
          set result to result & body of myNote
          set result to result & linefeed & "<hr/>" & linefeed & linefeed
        end repeat
      end if
    end repeat
  end tell
end tell
set result to result & "</body></html>"
set the clipboard to result

"You're gonna need a bigger boat." [A brief look at data storage requirements in today's world]

I've previously blogged about my data storage/backup strategy. Briefly, I've got one big drive in my home server that stores all the data my family cares about: mostly music, pictures, and videos (with a little bit of other stuff for good measure). To protect the data, I've got another equally big external drive that I connect occasionally and use for backups by simply mirroring the content of the internal drive.

As things stand today, the internal drive is 320GB and the external drive is 300GB, but I've hit the wall and am almost out of space to add new files. Looking at hard drive prices these days, the sweet spot (measured in $/GB) seems to be with 500GB drives at about $140 (PATA or SATA). Any smaller than that and the delta from 300GB isn't enough to be interesting - any larger than that and the cost really goes up.

I was already prepared to buy a new drive every year or so to allow for growth, so I was curious if getting a 500GB drive now would do the trick. I wrote a quick program to look at every file I backup and tally up the size according to the date the file was created. The C# program walks the whole directory tree, sums the sizes by date, and writes out a simple CSV file with the results. The idea here is to chart the rate at which I'm adding data in order to predict when I'd run out of space next. (Yes, it's easy to come up with more sophisticated heuristics, but this is really just a back-of-the-envelope calculation and doesn't need to be perfect to be meaningful.)

Last night I opened the CSV file in Excel and charted the data. The resulting chart looks like this:

Data Storage Space (GB)

The blue line represents the cumulative size of the data I had at each point in time (horizontal axis) measured in GB (vertical axis) - you can see that I'm just above 300GB today. The red line is Excel's exponential trend line for the same data - it matches the blue line almost perfectly, so it seems pretty safe to say that my data storage needs are increasing exponentially. I was kind of afraid of that, because it means the 500GB drives I've been considering are likely to fill up within the next 8 months!

Clearly, I need to be prepared to spend more on hard drives than I'd initially planned to - or else I'm going to need to significantly change how I do things. I've got some ideas I'm still considering, but charting this data was a good wake-up call that drive capacity isn't increasing as rapidly as I might like. :)

I think that data storage and backup are issues that will affect all of us pretty soon (if they're not already). Backing up to DVDs doesn't scale well once you need more than 10 or so DVDs, and backing up over the network just doesn't seem practical when you're talking about numbers this large. Even ignoring the need to backup, simply storing all the data you have is rapidly becoming an issue. With downloadable HD movie/TV content becoming popular, high megapixel still/video cameras being commonplace, and fast Internet connections becoming the norm, it seems to me that content is outpacing storage right now.

Here's hoping for a quantum leap in storage technology!

Updated on 2007/03/14: I've just posted the source code for the program I wrote to gather this data.

Tags: Technical