Exponential backup rotation, part 2

27.06.2015 17:45

I've been writing previously about Chris Forno's log2rotate algorithm. After using my Python implementation for a while, one problem became apparent: the algorithm is quite bad at handling missing backups.

Missing backups happen in practice. For instance, a host might be down or without network connectivity at the time the daily backup Cron job is supposed to run. So every once in a while, a daily backup might be skipped. Unfortunately, log2rotate makes a strong assumption that new backups always come in a regular time series. The worst case happens if a backup that is selected by the algorithm for long-term storage is missing. This can double the age of restored files after a data loss.

Chris' original implementation does a simple check: if an expected backup is not there, it simply throws a fatal error. You can ignore the error with the --unsafe switch, but then it makes no promises as to what the resulting backup series will be.

My initial Python port of the algorithm basically mimicked this behavior (although with a bit smarter check that avoided some false alarms). Now, I've added another step in the algorithm that selects which backups to keep. I still calculate the pattern of retained backups using Chris' algorithm. However after that, I do a search for a backup that actually exists, within some limited time interval.

Most recent backup age versus time of data loss.

The graph above demonstrates this improvement. It shows the age of the most recent backup in a case that data loss occurred at some past point in time. On the horizontal axis, present day is day 100 and day 0 is the day the oldest backup was made.

The black dashed line shows the ideal case - using log2rotate algorithm and no missing backups. For example, if we wanted to restore a file that we know was lost on day 60 (40 days ago), the age of the most recent retained backup that still contains the file (created on day 32) is 28 days.

The blue line shows the worst case for a single missing backup using the original algorithm, but overriding errors with --unsafe. For restoring a file that was lost on day 60, the worst case is if backup on day 32 is missing. In that case, the next most recent backup is from day 0, which is 60 days old.

Finally, the red line shows the situation with the fuzzy matching algorithm. In this case, if backup on day 32 is missing, the algorithm retains the backup on day 31. So when restoring data lost on day 60, the most recent backup is only one day older compared to when no backups are missing.

The updated code is on GitHub. Documentation is somewhat lacking at the moment, but the algorithm is pretty well tested (backups are always a bit of a sensitive thing). If I see some interest, I might clean it up and push it to PyPi.

Posted by Tomaž | Categories: Code

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)