Long-term AFS Archives with bup¶
Overview¶
On Chicago, we run a program called bup for all our long-term archives. It uses git as a content-addressable block store, allowing for efficient storage of slowly-changing contents like homedirs and service backups. Or at least, that’s the theory.
In any case, the basic automation pipeline is something like this:
- On a relatively frequent schedule, AFS volumes are released to Chicago.
- On a less frequent schedule, these volumes are dumped (with vos dump), processed, and archived usingbup split.
The automation is overseen by AFS BOS.
So where are the backups?¶
The bup store is in /z/bup/ on chicago. The ${BUP_DIR} environment variable is set to this in a few scripts on chicago that need it.
Looking at or restoring an archive file¶
Because bup uses git, git’s tools may be used to explore a little bit.
- git branch -awill show you the names of all the archives stored internally to the bup store.
- git logwill show you the timestamps at which archives were taken.
- git’s so-called “approxidate” framework can be of some assistance;
git log--sinceand--untilwill look at the date stamps of the commits and can help narrow your search. While the syntax for dated refs,${BRANCH}@{${DATE_STRING}}, is present, note that it uses the reflog timestamp rather than the commit timestamp. Many early snapshots are therefore understood incorrectly.
- git verify-packand- git fsckcan be used to sanity-check the backing store, if that is ever needed. We should try to be good about keeping the older packs off-line as well as on-.
However, ultimately, what you care about, I assume, is the output of a
bup join command.
Restoring from archive without nuking an exisiting volume¶
If a user has asked for an emergency restore but does not want their home directory clobbered, consider creating, mounting, and restoring a new volume for them. Something like
vos create chicago.acm.jhu.edu viceps recover.$USER
fs mkm ~$USER/acmsys/recover recover.$USER.readonly
bup -d ${BUP_DIR} join ${GIT_REF_NAME_OR_HASH} | \
  vos restore chicago.acm.jhu.edu viceps recover.$USER -readonly
And then vos remove the partition when the user has gotten their files
back.
Inserting a dump into the archive¶
file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-to-bup.sh (on
Chicago, ~root/bin/dump-to-bup.sh) knows how to push a volume into the archive.
The simplest thing, if you want to run it by hand, is to run:
echo ${VOLUMENAME} ${VOLUMENAME} | dump-to-bup.sh
Yes, the name should be repeated twice as dump-to-bup expects to be reading
from the output of vos listvol, but the name will be resolved if an integer
ID is not provided in the second column.
file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-all-to-bup.sh (on
Chicago, ~root/bin/dump-all-to-bup.sh) knows how to loop over the entire
archive partition and invoke dump-to-bup appropriately.
Repacking Bup Packs¶
Nine out of ten, or more, of the packs we write are very, very tiny: users home
directories do not change that quickly.  As such, left to its own devices, bup
will fill up the ${BUP_DIR}/objects/pack directory with many, many tiny
files, which is, of course, detrimental to performance.  On the other hand,
full repacks of huge repositories take forever, so…  we compromise by
repacking all “small” pack files together at the end of every night’s dump.
Since we have on the order of thousands of volumes, this will not create a huge
number of files that need to be dealt with.  See
file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/bup-tools.sh for
details.
Eventually, that approach might be a little heavy-handed (asking bup to rebuild all its data structures), but for the moment, those steps are entirely dominated by the git repack itself.
Note
We have also found it hugely temporally beneficial to disable git’s delta coding. It makes our archives a little bigger than they might otherwise be, but it means that our repacks do not take hours on end. This is accomplished by:
git config --local pack.window 0
git config --local pack.depth  1
Mirroring Some Or All Of The Bup Archive¶
Of course, one answer is just to rsync the entirety of the bup archive somewhere else. Our repacking game above means there will be a little slop, but hopefully not too much – once things are committed to “big” packs, they won’t ever move again.
You can also use git and maintain your own local pack structure. If you want to access things having done this, you’ll need to have bup recreate its midx and bloom files, possibly, but that’s straightforward.
Creating a git repository and creating a remote section with something like
[remote "chicago"]
      url = root@chicago.acm.jhu.edu:/z/bup
      fetch = +refs/heads/root/*:refs/remotes/chicago/root/*
      fetch = +refs/heads/service/*:refs/remotes/chicago/service/*
      fetch = +refs/heads/group/readonly:refs/remotes/chicago/group/readonly
      fetch = +refs/heads/group/acm-museum.readonly:refs/remotes/chicago/group/acm-museum.readonly
      fetch = +refs/heads/group/admins.readonly:refs/remotes/chicago/group/admins.readonly
      fetch = +refs/heads/group/admins.pub.readonly:refs/remotes/chicago/group/admins.pub.readonly
      fetch = +refs/heads/group/officers.readonly:refs/remotes/chicago/group/officers.readonly
      fetch = +refs/heads/group/officers.pub.readonly:refs/remotes/chicago/group/officers.pub.readonly
      fetch = +refs/heads/mirror/readonly:refs/remotes/chicago/mirror/readonly
      fetch = +refs/heads/user/readonly:refs/remotes/chicago/user/readonly
will allow you to selectively archive parts of the system.  Isn’t that neat?
You can always add another fetch = line and run git fetch chicago to
bring more things over.
Extracting Every Revision Of A Volume¶
Basically a loop around the restoration game above. You can extract the hash and time of every revision of a branch with:
GIT_DIR=/z/bup git log --pretty=tformat:'%H %ct' ${BRANCH} > hashtimes
for example. Then maybe something like
vos create chicago.acm.jhu.edu viceps ${TMP_VOL_NAME}
fs mkm v ${TMP_VOL_NAME}
exec 3<hashtimes
while read -u 3 hash time; do \
  echo $time;
  bup -d $GIT_DIR join $hash | \
     vos restore chicago.acm.jhu.edu viceps ${TMP_VOL_NAME} -overwrite full
  cp -a v $time;
done
exec 3<&-
fs rmm v
rm hashtimes