Project

General

Profile

The ungleich ceph handbook » History » Revision 12

Revision 11 (Nico Schottelius, 11/07/2018 05:02 PM) → Revision 12/73 (Nico Schottelius, 11/26/2018 11:37 AM)

h1. The ungleich ceph handbook 

 {{toc}} 

 

 h2. Status 

 This document is **IN PRODUCTION**. 

 

 h2. Introduction 

 This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for  

 h2. Communication guide 

 Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. 

 For this reason communicate whenever I/O recovery settings are temporarily tuned. 

 h2. Adding a new disk/ssd to the ceph cluster 

 h3. For Dell servers 

 First find the disk and then add it to the operating system 

 <pre> 
 megacli -PDList -aALL    | grep -B16 -i unconfigur 

 # Sample output: 
 [19:46:50] server7.place6:~#    megacli -PDList -aALL    | grep -B16 -i unconfigur 
 Enclosure Device ID: N/A 
 Slot Number: 0 
 Enclosure position: N/A 
 Device Id: 0 
 WWN: 0000000000000000 
 Sequence Number: 1 
 Media Error Count: 0 
 Other Error Count: 0 
 Predictive Failure Count: 0 
 Last Predictive Failure Event Seq Number: 0 
 PD Type: SATA 

 Raw Size: 894.252 GB [0x6fc81ab0 Sectors] 
 Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] 
 Coerced Size: 893.75 GB [0x6fb80000 Sectors] 
 Sector Size:    0 
 Firmware state: Unconfigured(good), Spun Up 
 </pre> 

 Then add the disk to the OS: 

 <pre> 
 megacli -CfgLdAdd -r0 [enclosure:slot] -aX 

 # Sample call, if enclosure and slot are KNOWN (aka not N/A) 
 megacli -CfgLdAdd -r0 [32:0] -a0 

 # Sample call, if enclosure is N/A 
 megacli -CfgLdAdd -r0 [:0] -a0 
 </pre> 

 

 h2. Moving a disk/ssd to another server 

 (needs to be described better) 

 Generally speaking: 

 * /opt/ungleich-tools/ceph-osd-stop-disable does the following: 
 ** Stop the osd, remove monit on the server you want to take it out 
 ** umount the disk 
 * Take disk out 
 * Discard preserved cache on the server you took it out  
 ** using megacli 
 * Insert into new server 
 * Clear foreign configuration 
 ** using megacli 
 * Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) 
 ** No creating of the osd required! 
 * Verify that the disk exists and that the osd is started 
 ** using *ps aux* 
 ** using *ceph osd tree* 
 * */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number 
 ** Creates the monit configuration file so that monit watches the OSD 
 ** Reload monit 
 * Verify monit using *monit status* 

 

 h2. Removing a disk/ssd 

 To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced. 

 

 h2. Handling DOWN osds with filesystem errors 

 If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: 

 * Login to any ceph monitor (cephX.placeY.ungleich.ch) 
 * Check **ceph -s**, find host using **ceph osd tree** 
 * Login to the affected host 
 * Run the following commands: 
 ** ls /var/lib/ceph/osd/ceph-XX 
 ** dmesg 
 * Create a new ticket in the datacenter light project 
 ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" 
 ** Add (partial) output of above commands 
 ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster 
 ** Remove the physical disk from the host, checkout if there is warranty on it and if yes 
 *** Create a short letter to the vendor, including technical details a from above 
 *** Record when you sent it in 
 *** Put ticket into status waiting 
 ** If there is no warranty, dispose it 



 h2. Change ceph speed for i/o recovery 

 By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. 

 The default configuration on our servers contains: 

 <pre> 
 [osd] 
 osd max backfills = 1 
 osd recovery max active = 1 
 osd recovery op priority = 2 
 </pre> 

 The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. 

 To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: 

 <pre> 
 ceph tell osd.* injectargs '--osd-max-backfills Y' 
 ceph tell osd.* injectargs '--osd-recovery-max-active X' 
 </pre> 

 where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. 

 

 h2. Debug scrub errors / inconsistent pg message 

 From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem. 

 If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. 

 h2. Move servers into the osd tree 

 New servers have their buckets placed outside the **default root** and thus need to be moved inside. 
 Output might look as follows: 

 <pre> 
 [11:19:27] server5.place6:~# ceph osd tree 
 ID    CLASS     WEIGHT      TYPE NAME          STATUS REWEIGHT PRI-AFF  
  -3             0.87270 host server5                              
  41       ssd     0.87270       osd.41             up    1.00000 1.00000  
  -1           251.85580 root default                              
  -7            81.56271       host server2                          
   0 hdd-big     9.09511           osd.0          up    1.00000 1.00000  
   5 hdd-big     9.09511           osd.5          up    1.00000 1.00000  
 ... 
 </pre> 


 Use **ceph osd crush move serverX root=default** (where serverX is the new server), 
 which will move the bucket in the right place: 

 <pre> 
 [11:21:17] server5.place6:~# ceph osd crush move server5 root=default 
 moved item id -3 name 'server5' to location {root=default} in crush map 
 [11:32:12] server5.place6:~# ceph osd tree 
 ID    CLASS     WEIGHT      TYPE NAME          STATUS REWEIGHT PRI-AFF  
  -1           252.72850 root default                              
 ... 
  -3             0.87270       host server5                          
  41       ssd     0.87270           osd.41         up    1.00000 1.00000  


 </pre>