The ungleich ceph handbook » History » Version 24
Jin-Guk Kwon, 03/11/2019 08:26 AM
1 | 1 | Nico Schottelius | h1. The ungleich ceph handbook |
---|---|---|---|
2 | |||
3 | 3 | Nico Schottelius | {{toc}} |
4 | |||
5 | 1 | Nico Schottelius | h2. Status |
6 | |||
7 | 7 | Nico Schottelius | This document is **IN PRODUCTION**. |
8 | 1 | Nico Schottelius | |
9 | h2. Introduction |
||
10 | |||
11 | This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for |
||
12 | |||
13 | h2. Communication guide |
||
14 | |||
15 | Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. |
||
16 | |||
17 | For this reason communicate whenever I/O recovery settings are temporarily tuned. |
||
18 | |||
19 | 20 | Nico Schottelius | h2. Analysing |
20 | |||
21 | 21 | Nico Schottelius | h3. ceph osd df tree |
22 | 20 | Nico Schottelius | |
23 | Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced. |
||
24 | |||
25 | 22 | Nico Schottelius | h3. Find out the device of an OSD |
26 | |||
27 | Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located: |
||
28 | |||
29 | <pre> |
||
30 | |||
31 | [16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31 |
||
32 | /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota) |
||
33 | </pre> |
||
34 | |||
35 | |||
36 | 2 | Nico Schottelius | h2. Adding a new disk/ssd to the ceph cluster |
37 | |||
38 | h3. For Dell servers |
||
39 | |||
40 | First find the disk and then add it to the operating system |
||
41 | |||
42 | <pre> |
||
43 | megacli -PDList -aALL | grep -B16 -i unconfigur |
||
44 | |||
45 | # Sample output: |
||
46 | [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur |
||
47 | Enclosure Device ID: N/A |
||
48 | Slot Number: 0 |
||
49 | Enclosure position: N/A |
||
50 | Device Id: 0 |
||
51 | WWN: 0000000000000000 |
||
52 | Sequence Number: 1 |
||
53 | Media Error Count: 0 |
||
54 | Other Error Count: 0 |
||
55 | Predictive Failure Count: 0 |
||
56 | Last Predictive Failure Event Seq Number: 0 |
||
57 | PD Type: SATA |
||
58 | |||
59 | Raw Size: 894.252 GB [0x6fc81ab0 Sectors] |
||
60 | Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] |
||
61 | Coerced Size: 893.75 GB [0x6fb80000 Sectors] |
||
62 | Sector Size: 0 |
||
63 | Firmware state: Unconfigured(good), Spun Up |
||
64 | </pre> |
||
65 | |||
66 | Then add the disk to the OS: |
||
67 | |||
68 | <pre> |
||
69 | 19 | Jin-Guk Kwon | megacli -CfgLdAdd -r0 [enclosure position:slot] -aX (X : host is 0. marray is 1) |
70 | 2 | Nico Schottelius | |
71 | # Sample call, if enclosure and slot are KNOWN (aka not N/A) |
||
72 | megacli -CfgLdAdd -r0 [32:0] -a0 |
||
73 | |||
74 | # Sample call, if enclosure is N/A |
||
75 | megacli -CfgLdAdd -r0 [:0] -a0 |
||
76 | </pre> |
||
77 | |||
78 | 1 | Nico Schottelius | h2. Moving a disk/ssd to another server |
79 | 4 | Nico Schottelius | |
80 | (needs to be described better) |
||
81 | |||
82 | Generally speaking: |
||
83 | |||
84 | 9 | Nico Schottelius | * /opt/ungleich-tools/ceph-osd-stop-disable does the following: |
85 | ** Stop the osd, remove monit on the server you want to take it out |
||
86 | ** umount the disk |
||
87 | 1 | Nico Schottelius | * Take disk out |
88 | * Discard preserved cache on the server you took it out |
||
89 | 23 | Nico Schottelius | ** using megacli: @megacli -DiscardPreservedCache -Lall -a0@ |
90 | 1 | Nico Schottelius | * Insert into new server |
91 | 9 | Nico Schottelius | * Clear foreign configuration |
92 | 23 | Nico Schottelius | ** using megacli: @megacli -CfgForeign -Clear -a0@ |
93 | 9 | Nico Schottelius | * Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) |
94 | ** No creating of the osd required! |
||
95 | * Verify that the disk exists and that the osd is started |
||
96 | ** using *ps aux* |
||
97 | ** using *ceph osd tree* |
||
98 | 10 | Nico Schottelius | * */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number |
99 | 9 | Nico Schottelius | ** Creates the monit configuration file so that monit watches the OSD |
100 | ** Reload monit |
||
101 | 11 | Nico Schottelius | * Verify monit using *monit status* |
102 | 1 | Nico Schottelius | |
103 | h2. Removing a disk/ssd |
||
104 | 5 | Nico Schottelius | |
105 | To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced. |
||
106 | 1 | Nico Schottelius | |
107 | h2. Handling DOWN osds with filesystem errors |
||
108 | |||
109 | If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: |
||
110 | |||
111 | * Login to any ceph monitor (cephX.placeY.ungleich.ch) |
||
112 | * Check **ceph -s**, find host using **ceph osd tree** |
||
113 | * Login to the affected host |
||
114 | * Run the following commands: |
||
115 | ** ls /var/lib/ceph/osd/ceph-XX |
||
116 | ** dmesg |
||
117 | 24 | Jin-Guk Kwon | <pre> |
118 | ex) After checking message of dmesg, you can do next step |
||
119 | [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64 |
||
120 | [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c. Return address = 0xffffffffc08eb612 |
||
121 | [204696.410702] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem |
||
122 | [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem( |
||
123 | </pre> |
||
124 | |||
125 | 1 | Nico Schottelius | * Create a new ticket in the datacenter light project |
126 | ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" |
||
127 | ** Add (partial) output of above commands |
||
128 | ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster |
||
129 | ** Remove the physical disk from the host, checkout if there is warranty on it and if yes |
||
130 | *** Create a short letter to the vendor, including technical details a from above |
||
131 | *** Record when you sent it in |
||
132 | *** Put ticket into status waiting |
||
133 | ** If there is no warranty, dispose it |
||
134 | |||
135 | h2. Change ceph speed for i/o recovery |
||
136 | |||
137 | By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. |
||
138 | |||
139 | The default configuration on our servers contains: |
||
140 | |||
141 | <pre> |
||
142 | [osd] |
||
143 | osd max backfills = 1 |
||
144 | osd recovery max active = 1 |
||
145 | osd recovery op priority = 2 |
||
146 | </pre> |
||
147 | |||
148 | The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. |
||
149 | |||
150 | To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: |
||
151 | |||
152 | <pre> |
||
153 | ceph tell osd.* injectargs '--osd-max-backfills Y' |
||
154 | ceph tell osd.* injectargs '--osd-recovery-max-active X' |
||
155 | </pre> |
||
156 | |||
157 | where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. |
||
158 | |||
159 | h2. Debug scrub errors / inconsistent pg message |
||
160 | 6 | Nico Schottelius | |
161 | 1 | Nico Schottelius | From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem. |
162 | |||
163 | If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. |
||
164 | 12 | Nico Schottelius | |
165 | h2. Move servers into the osd tree |
||
166 | |||
167 | New servers have their buckets placed outside the **default root** and thus need to be moved inside. |
||
168 | Output might look as follows: |
||
169 | |||
170 | <pre> |
||
171 | [11:19:27] server5.place6:~# ceph osd tree |
||
172 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
173 | -3 0.87270 host server5 |
||
174 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
175 | -1 251.85580 root default |
||
176 | -7 81.56271 host server2 |
||
177 | 0 hdd-big 9.09511 osd.0 up 1.00000 1.00000 |
||
178 | 5 hdd-big 9.09511 osd.5 up 1.00000 1.00000 |
||
179 | ... |
||
180 | </pre> |
||
181 | |||
182 | |||
183 | Use **ceph osd crush move serverX root=default** (where serverX is the new server), |
||
184 | which will move the bucket in the right place: |
||
185 | |||
186 | <pre> |
||
187 | [11:21:17] server5.place6:~# ceph osd crush move server5 root=default |
||
188 | moved item id -3 name 'server5' to location {root=default} in crush map |
||
189 | [11:32:12] server5.place6:~# ceph osd tree |
||
190 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
191 | -1 252.72850 root default |
||
192 | ... |
||
193 | -3 0.87270 host server5 |
||
194 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
195 | |||
196 | |||
197 | </pre> |
||
198 | 13 | Nico Schottelius | |
199 | h2. How to fix existing osds with wrong partition layout |
||
200 | |||
201 | In the first version of DCL we used filestore/3 partition based layout. |
||
202 | In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout. |
||
203 | |||
204 | To convert, we delete the old OSD, clean the partitions and create a new osd: |
||
205 | |||
206 | 14 | Nico Schottelius | h3. Inactive OSD |
207 | 1 | Nico Schottelius | |
208 | 14 | Nico Schottelius | If the OSD is *not active*, we can do the following: |
209 | |||
210 | 13 | Nico Schottelius | * Find the OSD number: mount the partition and find the whoami file |
211 | |||
212 | <pre> |
||
213 | root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/ |
||
214 | root@server2:/opt/ungleich-tools# cat /mnt/whoami |
||
215 | 0 |
||
216 | root@server2:/opt/ungleich-tools# umount /mnt/ |
||
217 | |||
218 | </pre> |
||
219 | |||
220 | * Verify in the *ceph osd tree* that the OSD is on that server |
||
221 | * Deleting the OSD |
||
222 | ** ceph osd crush remove $osd_name |
||
223 | 1 | Nico Schottelius | ** ceph osd rm $osd_name |
224 | 14 | Nico Schottelius | |
225 | Then continue below as described in "Recreating the OSD". |
||
226 | |||
227 | h3. Remove Active OSD |
||
228 | |||
229 | * Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD |
||
230 | * Then continue below as described in "Recreating the OSD". |
||
231 | |||
232 | |||
233 | h3. Recreating the OSD |
||
234 | |||
235 | 13 | Nico Schottelius | * Create an empty partition table |
236 | ** fdisk /dev/sdX |
||
237 | ** g |
||
238 | ** w |
||
239 | * Create a new OSD |
||
240 | ** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS |
||
241 | 15 | Jin-Guk Kwon | |
242 | h2. How to fix unfound pg |
||
243 | |||
244 | refer to https://redmine.ungleich.ch/issues/6388 |
||
245 | 16 | Jin-Guk Kwon | |
246 | * Check health state |
||
247 | ** ceph health detail |
||
248 | * Check which server has that osd |
||
249 | ** ceph osd tree |
||
250 | * Check which VM is running in server place |
||
251 | 17 | Jin-Guk Kwon | ** virsh list |
252 | 16 | Jin-Guk Kwon | * Check pg map |
253 | 17 | Jin-Guk Kwon | ** ceph osd map [osd pool] [VMID] |
254 | 18 | Jin-Guk Kwon | * revert pg |
255 | ** ceph pg [PGID] mark_unfound_lost revert |