Ceph Storage System

The ACM uses Ceph to manage its collection of storage. Ceph itself has extensive documentation.

Configuration

The global parameters of our ceph cluster are available for perusal in file:///afs/acm.jhu.edu/readonly/group/admins.pub/ceph.conf . The various Ceph worker nodes should all have /etc/ceph/ceph.conf symlinked to that location. (As usual, be sure to use the RO mountpoint for the benefits of replication; see The Special Case of admins.pub for details.)

Note

This file should be kept up to date. Do note, however, that the setting mon_host gives the clients a list of possible mon locations; only one of them must be correct for in order for an operation to succeed, as the mons maintain their own address lists in the monmap itself.

Authentication

Ceph rolls its own authentication and authorization layers, which is a bit of a bummer, but it is what it is, I suppose. Thankfully, we don’t fiddle with this often. You can inspect the database with ceph auth list.

Warning

This list command displays secrets to the console and there does not seem to be an alternative or option to not!

Creating a New Ceph User

Run ceph auth add ${NAME} with optional repetitions of ${CAPTY} ${CAPVAL} following, e.g. mon "allow r" osd "allow class-read object_prefix rbd_children" or somesuch. ceph auth caps can be used to change a user’s caps later.

The equivalent of extracting a keytab is done with ceph auth get-key.

Maintenance Tasks

Getting Cluster Status

On magellan or the other ceph worker (mon or osd) nodes, commands like ceph status or ceph status -w will show you what’s up on the storage cluster (and keep you up to date).

ceph pg dump_stuck is useful while Ceph is rebalancing itself.

Identifying RBD Objects

Map RBD names to Ceph internal object identifier prefix (used by OSDs as part of the filename). Sometimes this works:

VNAME=volume-5612246f-a1ec-4faa-9a41-ac33876805bc
POOL=volumes_2
rados -p ${POOL} get rbd_id.${VNAME} - | strings

And sometimes you need to use the “dot rbd” object instead:

VNAME=home-1
POOL=rbdafs-home
rados -p ${POOL} get ${VNAME}.rbd - | strings | grep rb

Note

I have no idea why one and not the other; I suspect the latter is the older way of doing things?

Rebalancing Scrubs

You can see the days of the week that the last round of deep-scrubs occurred on with a command like the following (taken from http://cephnotes.ksperis.com/blog/2013/08/27/deep-scrub-distribution )

for date in `ceph pg dump | grep active | awk '{print $22}'`; do
  date +%A -d $date;
done | sort | uniq -c

Re-balancing the ceph deep-scrub schedule may be done with something like the following. Note that this takes a week to run. “A week” comes from the fact that “osd deep scrub interval” is set, by default, to 1 week.

ceph pg dump | awk '/active/{ print $1 }' \
  | (while read i; do echo $i;
       ceph pg deep-scrub $i;
       sleep $((604800 / `ceph pg stat | sed -e 's/^.*: \([0-9]*\) pgs:.*$/\1/'`));
     done)

What’s On An OSD

Run:

ceph pg ls-by-osd osd.$X

Slowly Easing In or Out OSDs

Try some shell scripting:

# This script assumes oldschool weight 8 osds! No longer valid. Await
# useful scripts in admins.pub/scripts/ceph
await_ceph() { until ceph health | grep HEALTH_OK; do
  echo -n 'Waiting... '; date; sleep 15; done }

for w in `seq 1 8`; do
 await_ceph
 for osd in 2 3 4 5; do ceph osd crush reweight osd.${osd} $w; done
 sleep 30
done

Quickly Removing an OSD

Avoid double-computation of the new CRUSH maps by running ceph osd rados rm ${OSD} rather than marking it out and then taking it down. This will only trigger one round of backfilling and will result in less data motion. Otherwise, follow the procedure in http://ceph.com/docs/v0.78/rados/operations/add-or-rm-osds/#removing-osds-manual

Note

I assume that even after a ceph osd crush rm, that ceph will still allow the OSD in question to participate in its pgs replications.

Note

The procedure for adding an OSD to the cluster as officially documented uses ceph osd crush add, which will only compute the CRUSH map once.

Creating a New Mon

Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

  • Work in a temporary directory, for starters:

    cd /tmp
    
  • Grab the monitors’ keyring, so the mon can authenticate to the others:

    ceph auth get mon. -o keyring
    
  • Grab the current mon map:

    ceph mon getmap -o monmap
    
  • Create the data store for the mon. This should create a directory and files at /var/lib/ceph/mon/ceph-$MON_ID:

    sudo ceph-mon -i $MON_ID --mkfs --monmap monmap --keyring keyring
    
  • Add the mon to the mon map:

    ceph mon add $MON_ID $IP_ADDR
    
  • Start the mon:

    sudo service ceph -a start mon.$MON_ID
    
  • Delete the temporary copy of the monitors’ keyring and the mon map:

    rm keyring monmap
    
  • Update ceph.conf.

Removing a Mon

Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

  • First, shut down the mon:

    sudo service ceph -a stop mon.$MON_ID
    
  • Remove it from the known set of mons:

    ceph mon remove $MON_ID
    
  • Delete its data store.

    rm -r /var/lib/ceph/mon/ceph-$MON_ID
    
  • Update ceph.conf.

ZFS and Ceph

Look at https://github.com/zfsonlinux/zfs/issues/4913#issuecomment-268182335

Basically, follow those instructions :) But it can break things. Maybe don’t follow them.

Miscellany

Cern also uses Ceph; pay attention to everything they have to say on the matter. Notably, this includes http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern (via http://ceph.com/cephdays/frankfurt/ ).