The ungleich ceph handbook » History » Version 2
Nico Schottelius, 10/23/2018 07:52 PM
1 | 1 | Nico Schottelius | h1. The ungleich ceph handbook |
---|---|---|---|
2 | |||
3 | h2. Status |
||
4 | |||
5 | This document is **WORK IN PROGRESS**. |
||
6 | |||
7 | h2. Introduction |
||
8 | |||
9 | This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for |
||
10 | |||
11 | h2. Communication guide |
||
12 | |||
13 | Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. |
||
14 | |||
15 | For this reason communicate whenever I/O recovery settings are temporarily tuned. |
||
16 | |||
17 | 2 | Nico Schottelius | h2. Adding a new disk/ssd to the ceph cluster |
18 | |||
19 | h3. For Dell servers |
||
20 | |||
21 | First find the disk and then add it to the operating system |
||
22 | |||
23 | <pre> |
||
24 | megacli -PDList -aALL | grep -B16 -i unconfigur |
||
25 | |||
26 | # Sample output: |
||
27 | [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur |
||
28 | Enclosure Device ID: N/A |
||
29 | Slot Number: 0 |
||
30 | Enclosure position: N/A |
||
31 | Device Id: 0 |
||
32 | WWN: 0000000000000000 |
||
33 | Sequence Number: 1 |
||
34 | Media Error Count: 0 |
||
35 | Other Error Count: 0 |
||
36 | Predictive Failure Count: 0 |
||
37 | Last Predictive Failure Event Seq Number: 0 |
||
38 | PD Type: SATA |
||
39 | |||
40 | Raw Size: 894.252 GB [0x6fc81ab0 Sectors] |
||
41 | Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] |
||
42 | Coerced Size: 893.75 GB [0x6fb80000 Sectors] |
||
43 | Sector Size: 0 |
||
44 | Firmware state: Unconfigured(good), Spun Up |
||
45 | </pre> |
||
46 | |||
47 | Then add the disk to the OS: |
||
48 | |||
49 | <pre> |
||
50 | megacli -CfgLdAdd -r0 [enclosure:slot] -aX |
||
51 | |||
52 | # Sample call, if enclosure and slot are KNOWN (aka not N/A) |
||
53 | megacli -CfgLdAdd -r0 [32:0] -a0 |
||
54 | |||
55 | # Sample call, if enclosure is N/A |
||
56 | megacli -CfgLdAdd -r0 [:0] -a0 |
||
57 | </pre> |
||
58 | |||
59 | 1 | Nico Schottelius | |
60 | h2. Moving a disk/ssd to another server |
||
61 | |||
62 | h2. Removing a disk/ssd |
||
63 | |||
64 | h2. Handling DOWN osds with filesystem errors |
||
65 | |||
66 | If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: |
||
67 | |||
68 | * Login to any ceph monitor (cephX.placeY.ungleich.ch) |
||
69 | * Check **ceph -s**, find host using **ceph osd tree** |
||
70 | * Login to the affected host |
||
71 | * Run the following commands: |
||
72 | ** ls /var/lib/ceph/osd/ceph-XX |
||
73 | ** dmesg |
||
74 | * Create a new ticket in the datacenter light project |
||
75 | ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" |
||
76 | ** Add (partial) output of above commands |
||
77 | ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster |
||
78 | ** Remove the physical disk from the host, checkout if there is warranty on it and if yes |
||
79 | *** Create a short letter to the vendor, including technical details a from above |
||
80 | *** Record when you sent it in |
||
81 | *** Put ticket into status waiting |
||
82 | ** If there is no warranty, dispose it |
||
83 | |||
84 | |||
85 | |||
86 | h2. Change ceph speed for i/o recovery |
||
87 | |||
88 | By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. |
||
89 | |||
90 | The default configuration on our servers contains: |
||
91 | |||
92 | <pre> |
||
93 | [osd] |
||
94 | osd max backfills = 1 |
||
95 | osd recovery max active = 1 |
||
96 | osd recovery op priority = 2 |
||
97 | </pre> |
||
98 | |||
99 | The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. |
||
100 | |||
101 | To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: |
||
102 | |||
103 | <pre> |
||
104 | ceph tell osd.* injectargs '--osd-max-backfills Y' |
||
105 | ceph tell osd.* injectargs '--osd-recovery-max-active X' |
||
106 | </pre> |
||
107 | |||
108 | where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. |
||
109 | |||
110 | h2. Debug scrub errors / inconsistent pg message |
||
111 | |||
112 | From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a *ceph pg repair <number> fixes the problem. |
||
113 | |||
114 | If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. |