Ceph Storage System¶
The ACM uses Ceph to manage its collection of storage. Ceph itself has extensive documentation.
Configuration¶
The global parameters of our ceph cluster are available for perusal in
file:///afs/acm.jhu.edu/readonly/group/admins.pub/ceph.conf . The various Ceph
worker nodes should all have /etc/ceph/ceph.conf
symlinked to that
location. (As usual, be sure to use the RO mountpoint for the benefits of
replication; see The Special Case of admins.pub for details.)
Note
This file should be kept up to date. Do note, however, that the
setting mon_host
gives the clients a list of possible mon locations;
only one of them must be correct for in order for an operation to
succeed, as the mons maintain their own address lists in the monmap
itself.
Authentication¶
Ceph rolls its own authentication and authorization layers, which is a bit
of a bummer, but it is what it is, I suppose. Thankfully, we don’t fiddle
with this often. You can inspect the database with ceph auth list
.
Warning
This list command displays secrets to the console and there does not seem to be an alternative or option to not!
Creating a New Ceph User¶
Run ceph auth add ${NAME}
with optional repetitions of ${CAPTY}
${CAPVAL}
following, e.g. mon "allow r" osd "allow class-read object_prefix
rbd_children"
or somesuch. ceph auth caps
can be used to change a user’s
caps later.
The equivalent of extracting a keytab is done with ceph auth get-key
.
Maintenance Tasks¶
Getting Cluster Status¶
On magellan or the other ceph worker (mon or osd) nodes, commands like ceph
status
or ceph status -w
will show you what’s up on the storage cluster
(and keep you up to date).
ceph pg dump_stuck
is useful while Ceph is rebalancing itself.
Identifying RBD Objects¶
Map RBD names to Ceph internal object identifier prefix (used by OSDs as part of the filename). Sometimes this works:
VNAME=volume-5612246f-a1ec-4faa-9a41-ac33876805bc
POOL=volumes_2
rados -p ${POOL} get rbd_id.${VNAME} - | strings
And sometimes you need to use the “dot rbd” object instead:
VNAME=home-1
POOL=rbdafs-home
rados -p ${POOL} get ${VNAME}.rbd - | strings | grep rb
Note
I have no idea why one and not the other; I suspect the latter is the older way of doing things?
Rebalancing Scrubs¶
You can see the days of the week that the last round of deep-scrubs occurred on with a command like the following (taken from http://cephnotes.ksperis.com/blog/2013/08/27/deep-scrub-distribution )
for date in `ceph pg dump | grep active | awk '{print $22}'`; do
date +%A -d $date;
done | sort | uniq -c
Re-balancing the ceph deep-scrub schedule may be done with something like the following. Note that this takes a week to run. “A week” comes from the fact that “osd deep scrub interval” is set, by default, to 1 week.
ceph pg dump | awk '/active/{ print $1 }' \
| (while read i; do echo $i;
ceph pg deep-scrub $i;
sleep $((604800 / `ceph pg stat | sed -e 's/^.*: \([0-9]*\) pgs:.*$/\1/'`));
done)
Slowly Easing In or Out OSDs¶
Try some shell scripting:
# This script assumes oldschool weight 8 osds! No longer valid. Await
# useful scripts in admins.pub/scripts/ceph
await_ceph() { until ceph health | grep HEALTH_OK; do
echo -n 'Waiting... '; date; sleep 15; done }
for w in `seq 1 8`; do
await_ceph
for osd in 2 3 4 5; do ceph osd crush reweight osd.${osd} $w; done
sleep 30
done
Quickly Removing an OSD¶
Avoid double-computation of the new CRUSH maps by running ceph osd rados
rm ${OSD}
rather than marking it out
and then taking it down
.
This will only trigger one round of backfilling and will result in less data
motion. Otherwise, follow the procedure in
http://ceph.com/docs/v0.78/rados/operations/add-or-rm-osds/#removing-osds-manual
Note
I assume that even after a ceph osd crush rm
, that ceph will
still allow the OSD in question to participate in its pgs replications.
Note
The procedure for adding an OSD to the cluster as officially
documented uses ceph osd crush add
, which will only compute the CRUSH
map once.
Creating a New Mon¶
Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
Work in a temporary directory, for starters:
cd /tmp
Grab the monitors’ keyring, so the mon can authenticate to the others:
ceph auth get mon. -o keyring
Grab the current mon map:
ceph mon getmap -o monmap
Create the data store for the mon. This should create a directory and files at
/var/lib/ceph/mon/ceph-$MON_ID
:sudo ceph-mon -i $MON_ID --mkfs --monmap monmap --keyring keyring
Add the mon to the mon map:
ceph mon add $MON_ID $IP_ADDR
Start the mon:
sudo service ceph -a start mon.$MON_ID
Delete the temporary copy of the monitors’ keyring and the mon map:
rm keyring monmap
Update
ceph.conf
.
Removing a Mon¶
Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
First, shut down the mon:
sudo service ceph -a stop mon.$MON_ID
Remove it from the known set of mons:
ceph mon remove $MON_ID
Delete its data store.
rm -r /var/lib/ceph/mon/ceph-$MON_ID
Update
ceph.conf
.
ZFS and Ceph¶
Look at https://github.com/zfsonlinux/zfs/issues/4913#issuecomment-268182335
Basically, follow those instructions :) But it can break things. Maybe don’t follow them.
Miscellany¶
Cern also uses Ceph; pay attention to everything they have to say on the matter. Notably, this includes http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern (via http://ceph.com/cephdays/frankfurt/ ).