Creating Backups for ACM Services

The ACM systems are remarkably complex – lots of moving parts everywhere, etc. It is basically hopeless to stand the whole thing up again without guidance, which is tragic, because in my (nwf) tenure here I have seen essentially three complete reconstructions of the ACM computing environment. Every time, services get lost as institutional memory has graduated and gone off to the Real World (TM).

So: the philosophy of the current round of systems architects, I think, is that we are aiming for quick restore, rather than quick de novo deployment. Part of that effort is this very collection of documents that you’re now reading, and part of it is system-wide autonomic backups of system configuration files.

Many of the more interesting machines back themselves up into AFS; run ls /afs/acm.jhu.edu/service/*/backup.sh with your admin hat on to get an idea of who’s playing this game right now. Typically, those scripts are simple rsyncs into /afs/acm.jhu.edu/service/.../snapshot and invoked by the host’s root crontab with timelimit and k5start. All of those volumes should be replicated to archival partitions and then stashed into the archive. (See Long-term AFS Archives with bup.) So even if the world burns, most of the configuration, even historical configuration, can be found in the archives.

Of course, not everything is in the filesystem (which is, of course, a bug in its own right, but that’s another battle); our databases, for example, are also dumped nightly to files and those are then replicated using AFS and then again dumped to the archive. It’s a long path, but things get there!

Warning

PLEASE, PLEASE, PLEASE, if you have stood up a complex service, play this game! Later sysadmins will love you for it.

Roughly, the steps necessary, while wearing your admin hat, are to:

  • Generate a host keytab for your machine if it doesn’t have one already Place it in your machine’s /etc/krb5.keytab .

  • Run /afs/acm.jhu.edu/group/admins.pub/scripts/new-afs-service-volume ${YOUR_SERVICE} for your machine or service. this script will automagically handle preparing the volume for regular replication by the archive machine’s automation!

  • Grant your host’s principal rlidw for the service directory:

    fs sa /afs/acm.jhu.edu/service/${YOUR_SERVICE} rcmd.${YOUR_HOST} rlidw
    
    • Optionally, adjust the quota of your service volume.
  • Create a backup.sh file along these lines:

    #!/bin/sh
    
    rsync -rl -c --delete --relative -vv \
            \
            /etc/${IMPORTANT_FILE}     \
            /var/${OTHER_IMPORTANT_DIRECTORY} \
            \
            /afs/acm.jhu.edu/service/${YOUR_SERVICE}/snapshot
    
  • Add the backup to cron’s automation, using “crontab -e” as root to insert @daily timelimit -q k5start -f /etc/krb5.keytab -U -t -- /afs/acm.jhu.edu/service/${YOUR_SERVICE}/backup.sh.

Warning

Despite the possible presentation above, it is not possible to split commands across lines in crontab. The whole thing should appear on one line!

If you need help or inspiration, Chicago and Magellan are likely the two most elaborate setups to date.