Long-term AFS Archives with bup

Overview

On Chicago, we run a program called bup for all our long-term archives. It uses git as a content-addressable block store, allowing for efficient storage of slowly-changing contents like homedirs and service backups. Or at least, that’s the theory.

In any case, the basic automation pipeline is something like this:

  • On a relatively frequent schedule, AFS volumes are released to Chicago.
  • On a less frequent schedule, these volumes are dumped (with vos dump), processed, and archived using bup split.

The automation is overseen by AFS BOS.

So where are the backups?

The bup store is in /z/bup/ on chicago. The ${BUP_DIR} environment variable is set to this in a few scripts on chicago that need it.

Looking at or restoring an archive file

Because bup uses git, git’s tools may be used to explore a little bit.

  • git branch -a will show you the names of all the archives stored internally to the bup store.
  • git log will show you the timestamps at which archives were taken.
  • git’s so-called “approxidate” framework can be of some assistance; git log --since and --until will look at the date stamps of the commits and can help narrow your search. While the syntax for dated refs, ${BRANCH}@{${DATE_STRING}}, is present, note that it uses the reflog timestamp rather than the commit timestamp. Many early snapshots are therefore understood incorrectly.
  • git verify-pack and git fsck can be used to sanity-check the backing store, if that is ever needed. We should try to be good about keeping the older packs off-line as well as on-.

However, ultimately, what you care about, I assume, is the output of a bup join command.

Restoring from archive without nuking an exisiting volume

If a user has asked for an emergency restore but does not want their home directory clobbered, consider creating, mounting, and restoring a new volume for them. Something like

vos create chicago.acm.jhu.edu viceps recover.$USER
fs mkm ~$USER/acmsys/recover recover.$USER.readonly
bup -d ${BUP_DIR} join ${GIT_REF_NAME_OR_HASH} | \
  vos restore chicago.acm.jhu.edu viceps recover.$USER -readonly

And then vos remove the partition when the user has gotten their files back.

Inserting a dump into the archive

file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-to-bup.sh (on Chicago, ~root/bin/dump-to-bup.sh) knows how to push a volume into the archive. The simplest thing, if you want to run it by hand, is to run:

echo ${VOLUMENAME} ${VOLUMENAME} | dump-to-bup.sh

Yes, the name should be repeated twice as dump-to-bup expects to be reading from the output of vos listvol, but the name will be resolved if an integer ID is not provided in the second column.

file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-all-to-bup.sh (on Chicago, ~root/bin/dump-all-to-bup.sh) knows how to loop over the entire archive partition and invoke dump-to-bup appropriately.

Repacking Bup Packs

Nine out of ten, or more, of the packs we write are very, very tiny: users home directories do not change that quickly. As such, left to its own devices, bup will fill up the ${BUP_DIR}/objects/pack directory with many, many tiny files, which is, of course, detrimental to performance. On the other hand, full repacks of huge repositories take forever, so… we compromise by repacking all “small” pack files together at the end of every night’s dump. Since we have on the order of thousands of volumes, this will not create a huge number of files that need to be dealt with. See file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/bup-tools.sh for details.

Eventually, that approach might be a little heavy-handed (asking bup to rebuild all its data structures), but for the moment, those steps are entirely dominated by the git repack itself.

Note

We have also found it hugely temporally beneficial to disable git’s delta coding. It makes our archives a little bigger than they might otherwise be, but it means that our repacks do not take hours on end. This is accomplished by:

git config --local pack.window 0
git config --local pack.depth  1

Mirroring Some Or All Of The Bup Archive

Of course, one answer is just to rsync the entirety of the bup archive somewhere else. Our repacking game above means there will be a little slop, but hopefully not too much – once things are committed to “big” packs, they won’t ever move again.

You can also use git and maintain your own local pack structure. If you want to access things having done this, you’ll need to have bup recreate its midx and bloom files, possibly, but that’s straightforward.

Creating a git repository and creating a remote section with something like

[remote "chicago"]
      url = root@chicago.acm.jhu.edu:/z/bup
      fetch = +refs/heads/root/*:refs/remotes/chicago/root/*
      fetch = +refs/heads/service/*:refs/remotes/chicago/service/*
      fetch = +refs/heads/group/readonly:refs/remotes/chicago/group/readonly
      fetch = +refs/heads/group/acm-museum.readonly:refs/remotes/chicago/group/acm-museum.readonly
      fetch = +refs/heads/group/admins.readonly:refs/remotes/chicago/group/admins.readonly
      fetch = +refs/heads/group/admins.pub.readonly:refs/remotes/chicago/group/admins.pub.readonly
      fetch = +refs/heads/group/officers.readonly:refs/remotes/chicago/group/officers.readonly
      fetch = +refs/heads/group/officers.pub.readonly:refs/remotes/chicago/group/officers.pub.readonly
      fetch = +refs/heads/mirror/readonly:refs/remotes/chicago/mirror/readonly
      fetch = +refs/heads/user/readonly:refs/remotes/chicago/user/readonly

will allow you to selectively archive parts of the system. Isn’t that neat? You can always add another fetch = line and run git fetch chicago to bring more things over.

Extracting Every Revision Of A Volume

Basically a loop around the restoration game above. You can extract the hash and time of every revision of a branch with:

GIT_DIR=/z/bup git log --pretty=tformat:'%H %ct' ${BRANCH} > hashtimes

for example. Then maybe something like

vos create chicago.acm.jhu.edu viceps ${TMP_VOL_NAME}
fs mkm v ${TMP_VOL_NAME}
exec 3<hashtimes
while read -u 3 hash time; do \
  echo $time;
  bup -d $GIT_DIR join $hash | \
     vos restore chicago.acm.jhu.edu viceps ${TMP_VOL_NAME} -overwrite full
  cp -a v $time;
done
exec 3<&-
fs rmm v
rm hashtimes