Random Simple Things that have Worked in the Past and are Likely to Work Again

A Sunfire Goes Down

As of 6/27/2016.

Note: one of the sunfires can’t be booted with all the disks in. If this sunfire fails, pop all the disks except for the SSD (which is a pain to get back in), and the boot drives (drives 1 and 2). Then boot the machine. Then, after it has booted, put the disks back in, and follow the instructions below.

  1. ssh onto magellan, and run bmc sunfire0-bmc chassis power cycle.
  2. ssh onto the sunfire, probably through magellan. Run zfs mount -a if the zpools aren’t already mounted, and then start ceph (either through sys v init or systemd, depending on the sunfire).
  3. monitor ceph health (on any ceph monitor: crimea, magellan, or gomes) to ensure that ceph comes back up properly.

The Website is Reachable, but everything 403’s or 404’s

As of 7/6/2016.

Restart web.vm.

Mail server fails IMAP requests

As of 7/21/2016.

Run sudo journalctl -u dovecot on crimea.acm.jhu.edu.

If it says that a connection timed out to acmsys/Maildir, then there’s a problem with the afs mail dir servers on chicago.

First things first, check the ZFS status. Run zpoll status. If that says something is wrong, debug the zpool.

To restart the maildir server run /etc/init.d/openafs-fileserver restart. If takes longer than ~10 minutes something else is wrong. Try restarting chicago.

Echidna’s AFS servers died

As of 9/19/2016.

Reboot echidna.

You can’t do ceph things with cinder (like create/delete volumes)

As of 9/25/16.

Restart cinder-volume on gomes.

Ceph won’t start on a sunfire due to permission errors

As of 2/28/17.

Run chown -R ceph:ceph /var/run/ceph, then try again.

See http://tracker.ceph.com/issues/15553 for more info.

A ceph mon is down after a restart

As of 3/11/2017.

Run systemctl restart ceph. The issue is that, since our ceph config is served out of AFS, we have an implicit dependency on AFS, but systemd doesn’t know it (this should be fixed at some point). Anyway, by the time you ssh into the machine to manually restart ceph, openafs-client should be up, so simply restarting ceph should just work.

OpenStack VMs won’t be deleted, and they just hang

As of 3/11/2017.

Reboot gomes.

You Can’t Delete OpenStack VMs (they’re stuck in the deleting state)

As of 4/12/2017.

ssh to the compute node that the instance was running on, and restart the nova daemon on it.