The ungleich ceph handbook » History » Version 63
Nico Schottelius, 09/19/2022 07:09 AM
1 | 1 | Nico Schottelius | h1. The ungleich ceph handbook |
---|---|---|---|
2 | |||
3 | 3 | Nico Schottelius | {{toc}} |
4 | |||
5 | 1 | Nico Schottelius | h2. Status |
6 | |||
7 | 7 | Nico Schottelius | This document is **IN PRODUCTION**. |
8 | 1 | Nico Schottelius | |
9 | h2. Introduction |
||
10 | |||
11 | This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for |
||
12 | |||
13 | 45 | Nico Schottelius | h2. Processes |
14 | |||
15 | h3. Usage monitoring |
||
16 | |||
17 | * Usage should be kept somewhere in 70-75% area |
||
18 | * If usage reaches 72.5%, we start reducing usage by adding disks |
||
19 | * We stop when usage is below 70% |
||
20 | |||
21 | h3. Phasing in new disks |
||
22 | |||
23 | * 24h performance test prior to using it |
||
24 | |||
25 | h3. Phasing in new servers |
||
26 | |||
27 | * 24h performance test with 1 ssd or 1 hdd (whatever is applicable) |
||
28 | |||
29 | |||
30 | 1 | Nico Schottelius | h2. Communication guide |
31 | |||
32 | Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted. |
||
33 | |||
34 | For this reason communicate whenever I/O recovery settings are temporarily tuned. |
||
35 | |||
36 | 20 | Nico Schottelius | h2. Analysing |
37 | |||
38 | 21 | Nico Schottelius | h3. ceph osd df tree |
39 | 20 | Nico Schottelius | |
40 | Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced. |
||
41 | |||
42 | 22 | Nico Schottelius | h3. Find out the device of an OSD |
43 | |||
44 | Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located: |
||
45 | |||
46 | <pre> |
||
47 | |||
48 | [16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31 |
||
49 | /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota) |
||
50 | </pre> |
||
51 | |||
52 | 57 | Nico Schottelius | h3. Show config |
53 | |||
54 | <pre> |
||
55 | ceph config dump |
||
56 | </pre> |
||
57 | |||
58 | 58 | Nico Schottelius | h3. Show backfill and recovery config |
59 | |||
60 | <pre> |
||
61 | ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills" |
||
62 | </pre> |
||
63 | |||
64 | 59 | Nico Schottelius | * See also: https://www.suse.com/support/kb/doc/?id=000019693 |
65 | |||
66 | 63 | Nico Schottelius | h3. Checking and clearing crash reports |
67 | |||
68 | If the cluster is reporting HEALTH_WARN and a recent crash such as: |
||
69 | |||
70 | <pre> |
||
71 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s |
||
72 | cluster: |
||
73 | id: ... |
||
74 | health: HEALTH_WARN |
||
75 | 1 daemons have recently crashed |
||
76 | </pre> |
||
77 | |||
78 | One can analyse it using |
||
79 | |||
80 | * List the crashes: @ceph crash ls@ |
||
81 | * Checkout the details: @ceph crash info <id>@ |
||
82 | |||
83 | To archive the error: |
||
84 | |||
85 | * To archive a specific report: @ceph crash archive <id>@ |
||
86 | * To archive all: @ceph crash archive-all@ |
||
87 | |||
88 | After archiving, the cluster health should return to HEALTH_OK: |
||
89 | |||
90 | <pre> |
||
91 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash ls |
||
92 | ID ENTITY NEW |
||
93 | 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc mon.c * |
||
94 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc |
||
95 | [rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s |
||
96 | cluster: |
||
97 | id: .. |
||
98 | health: HEALTH_OK |
||
99 | |||
100 | </pre> |
||
101 | |||
102 | 2 | Nico Schottelius | h2. Adding a new disk/ssd to the ceph cluster |
103 | 1 | Nico Schottelius | |
104 | 25 | Jin-Guk Kwon | write on the disks, which order / date we bought it with a permanent marker. |
105 | |||
106 | 46 | Nico Schottelius | h3. Checking the shadow trees |
107 | |||
108 | To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree |
||
109 | using @ceph osd crush tree --show-shadow@: |
||
110 | |||
111 | <pre> |
||
112 | -16 hdd-big 653.03418 root default~hdd-big |
||
113 | -34 hdd-big 0 0 host server14~hdd-big |
||
114 | -38 hdd-big 0 0 host server15~hdd-big |
||
115 | -42 hdd-big 81.86153 78.28352 host server17~hdd-big |
||
116 | 36 hdd-big 9.09560 9.09560 osd.36 |
||
117 | 59 hdd-big 9.09499 9.09499 osd.59 |
||
118 | 60 hdd-big 9.09499 9.09499 osd.60 |
||
119 | 68 hdd-big 9.09599 8.93999 osd.68 |
||
120 | 69 hdd-big 9.09599 7.65999 osd.69 |
||
121 | 70 hdd-big 9.09599 8.35899 osd.70 |
||
122 | 71 hdd-big 9.09599 8.56000 osd.71 |
||
123 | 72 hdd-big 9.09599 8.93700 osd.72 |
||
124 | 73 hdd-big 9.09599 8.54199 osd.73 |
||
125 | -46 hdd-big 90.94986 90.94986 host server18~hdd-big |
||
126 | ... |
||
127 | </pre> |
||
128 | |||
129 | |||
130 | Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90. |
||
131 | SSDs and other classes have their own shadow trees, too. |
||
132 | |||
133 | |||
134 | 2 | Nico Schottelius | h3. For Dell servers |
135 | |||
136 | First find the disk and then add it to the operating system |
||
137 | |||
138 | <pre> |
||
139 | megacli -PDList -aALL | grep -B16 -i unconfigur |
||
140 | |||
141 | # Sample output: |
||
142 | [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur |
||
143 | Enclosure Device ID: N/A |
||
144 | Slot Number: 0 |
||
145 | Enclosure position: N/A |
||
146 | Device Id: 0 |
||
147 | WWN: 0000000000000000 |
||
148 | Sequence Number: 1 |
||
149 | Media Error Count: 0 |
||
150 | Other Error Count: 0 |
||
151 | Predictive Failure Count: 0 |
||
152 | Last Predictive Failure Event Seq Number: 0 |
||
153 | PD Type: SATA |
||
154 | |||
155 | Raw Size: 894.252 GB [0x6fc81ab0 Sectors] |
||
156 | Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] |
||
157 | Coerced Size: 893.75 GB [0x6fb80000 Sectors] |
||
158 | Sector Size: 0 |
||
159 | Firmware state: Unconfigured(good), Spun Up |
||
160 | </pre> |
||
161 | |||
162 | Then add the disk to the OS: |
||
163 | |||
164 | <pre> |
||
165 | 26 | ll nu | megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1) |
166 | 2 | Nico Schottelius | |
167 | # Sample call, if enclosure and slot are KNOWN (aka not N/A) |
||
168 | megacli -CfgLdAdd -r0 [32:0] -a0 |
||
169 | |||
170 | # Sample call, if enclosure is N/A |
||
171 | 1 | Nico Schottelius | megacli -CfgLdAdd -r0 [:0] -a0 |
172 | 25 | Jin-Guk Kwon | </pre> |
173 | |||
174 | Then check disk |
||
175 | |||
176 | <pre> |
||
177 | fdisk -l |
||
178 | [11:26:23] server2.place6:~# fdisk -l |
||
179 | ...... |
||
180 | Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors |
||
181 | Units: sectors of 1 * 512 = 512 bytes |
||
182 | Sector size (logical/physical): 512 bytes / 512 bytes |
||
183 | I/O size (minimum/optimal): 512 bytes / 512 bytes |
||
184 | [11:27:24] server2.place6:~# |
||
185 | </pre> |
||
186 | |||
187 | Then create gpt |
||
188 | |||
189 | <pre> |
||
190 | /opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX |
||
191 | [11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh |
||
192 | ...... |
||
193 | Created a new DOS disklabel with disk identifier 0x9c4a0355. |
||
194 | Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E). |
||
195 | ...... |
||
196 | </pre> |
||
197 | |||
198 | Then create osd for ssd/hdd-big |
||
199 | |||
200 | <pre> |
||
201 | /opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big) |
||
202 | [11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big |
||
203 | + set -e |
||
204 | + [ 2 -lt 2 ] |
||
205 | ...... |
||
206 | + /opt/ungleich-tools/monit-ceph-create-start osd.14 |
||
207 | osd.14 |
||
208 | [ ok ] Restarting daemon monitor: monit. |
||
209 | [11:36:14] server2.place6:~# |
||
210 | </pre> |
||
211 | |||
212 | Then check rebalancing(if you want to add another disk, you should do after rebalancing) |
||
213 | |||
214 | <pre> |
||
215 | ceph -s |
||
216 | [12:37:57] server2.place6:~# ceph -s |
||
217 | cluster: |
||
218 | id: 1ccd84f6-e362-4c50-9ffe-59436745e445 |
||
219 | health: HEALTH_WARN |
||
220 | 2248811/49628409 objects misplaced (4.531%) |
||
221 | ...... |
||
222 | io: |
||
223 | client: 170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr |
||
224 | recovery: 27.1MiB/s, 6objects/s |
||
225 | [12:49:41] server2.place6:~# |
||
226 | 2 | Nico Schottelius | </pre> |
227 | |||
228 | 1 | Nico Schottelius | h2. Moving a disk/ssd to another server |
229 | 4 | Nico Schottelius | |
230 | (needs to be described better) |
||
231 | |||
232 | Generally speaking: |
||
233 | |||
234 | 27 | ll nu | * //needs to be tested: disable recovery so data wont start move while you have the osd down |
235 | 9 | Nico Schottelius | * /opt/ungleich-tools/ceph-osd-stop-disable does the following: |
236 | ** Stop the osd, remove monit on the server you want to take it out |
||
237 | ** umount the disk |
||
238 | 1 | Nico Schottelius | * Take disk out |
239 | * Discard preserved cache on the server you took it out |
||
240 | 54 | Nico Schottelius | ** using megacli: @megacli -DiscardPreservedCache -Lall -aAll@ |
241 | 1 | Nico Schottelius | * Insert into new server |
242 | 9 | Nico Schottelius | * Clear foreign configuration |
243 | 54 | Nico Schottelius | ** using megacli: @megacli -CfgForeign -Clear -aAll@ |
244 | 9 | Nico Schottelius | * Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) |
245 | ** No creating of the osd required! |
||
246 | * Verify that the disk exists and that the osd is started |
||
247 | ** using *ps aux* |
||
248 | ** using *ceph osd tree* |
||
249 | 10 | Nico Schottelius | * */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number |
250 | 9 | Nico Schottelius | ** Creates the monit configuration file so that monit watches the OSD |
251 | ** Reload monit |
||
252 | 11 | Nico Schottelius | * Verify monit using *monit status* |
253 | 1 | Nico Schottelius | |
254 | 56 | Nico Schottelius | h2. OSD related processes |
255 | 1 | Nico Schottelius | |
256 | 56 | Nico Schottelius | h3. Removing a disk/ssd |
257 | |||
258 | 1 | Nico Schottelius | To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced. |
259 | |||
260 | 56 | Nico Schottelius | h3. Handling DOWN osds with filesystem errors |
261 | 1 | Nico Schottelius | |
262 | If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done: |
||
263 | |||
264 | * Login to any ceph monitor (cephX.placeY.ungleich.ch) |
||
265 | * Check **ceph -s**, find host using **ceph osd tree** |
||
266 | * Login to the affected host |
||
267 | * Run the following commands: |
||
268 | ** ls /var/lib/ceph/osd/ceph-XX |
||
269 | ** dmesg |
||
270 | 24 | Jin-Guk Kwon | <pre> |
271 | ex) After checking message of dmesg, you can do next step |
||
272 | [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64 |
||
273 | [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c. Return address = 0xffffffffc08eb612 |
||
274 | [204696.410702] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem |
||
275 | [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem( |
||
276 | </pre> |
||
277 | |||
278 | 1 | Nico Schottelius | * Create a new ticket in the datacenter light project |
279 | ** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" |
||
280 | ** Add (partial) output of above commands |
||
281 | ** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster |
||
282 | ** Remove the physical disk from the host, checkout if there is warranty on it and if yes |
||
283 | *** Create a short letter to the vendor, including technical details a from above |
||
284 | *** Record when you sent it in |
||
285 | *** Put ticket into status waiting |
||
286 | ** If there is no warranty, dispose it |
||
287 | |||
288 | 56 | Nico Schottelius | h3. [[Create new pool and place new osd]] |
289 | |||
290 | h3. Configuring auto repair on pgs |
||
291 | |||
292 | <pre> |
||
293 | ceph config set osd osd_scrub_auto_repair true |
||
294 | </pre> |
||
295 | |||
296 | Verify using: |
||
297 | |||
298 | <pre> |
||
299 | ceph config dump |
||
300 | </pre> |
||
301 | 39 | Jin-Guk Kwon | |
302 | 1 | Nico Schottelius | h2. Change ceph speed for i/o recovery |
303 | |||
304 | By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance. |
||
305 | |||
306 | The default configuration on our servers contains: |
||
307 | |||
308 | <pre> |
||
309 | [osd] |
||
310 | osd max backfills = 1 |
||
311 | osd recovery max active = 1 |
||
312 | osd recovery op priority = 2 |
||
313 | </pre> |
||
314 | |||
315 | The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority. |
||
316 | |||
317 | To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring: |
||
318 | |||
319 | <pre> |
||
320 | ceph tell osd.* injectargs '--osd-max-backfills Y' |
||
321 | ceph tell osd.* injectargs '--osd-recovery-max-active X' |
||
322 | </pre> |
||
323 | |||
324 | where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times. |
||
325 | |||
326 | h2. Debug scrub errors / inconsistent pg message |
||
327 | 6 | Nico Schottelius | |
328 | 1 | Nico Schottelius | From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem. |
329 | |||
330 | If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/. |
||
331 | 12 | Nico Schottelius | |
332 | h2. Move servers into the osd tree |
||
333 | |||
334 | New servers have their buckets placed outside the **default root** and thus need to be moved inside. |
||
335 | Output might look as follows: |
||
336 | |||
337 | <pre> |
||
338 | [11:19:27] server5.place6:~# ceph osd tree |
||
339 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
340 | -3 0.87270 host server5 |
||
341 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
342 | -1 251.85580 root default |
||
343 | -7 81.56271 host server2 |
||
344 | 0 hdd-big 9.09511 osd.0 up 1.00000 1.00000 |
||
345 | 5 hdd-big 9.09511 osd.5 up 1.00000 1.00000 |
||
346 | ... |
||
347 | </pre> |
||
348 | |||
349 | |||
350 | Use **ceph osd crush move serverX root=default** (where serverX is the new server), |
||
351 | which will move the bucket in the right place: |
||
352 | |||
353 | <pre> |
||
354 | [11:21:17] server5.place6:~# ceph osd crush move server5 root=default |
||
355 | moved item id -3 name 'server5' to location {root=default} in crush map |
||
356 | [11:32:12] server5.place6:~# ceph osd tree |
||
357 | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF |
||
358 | -1 252.72850 root default |
||
359 | ... |
||
360 | -3 0.87270 host server5 |
||
361 | 41 ssd 0.87270 osd.41 up 1.00000 1.00000 |
||
362 | |||
363 | |||
364 | </pre> |
||
365 | 13 | Nico Schottelius | |
366 | h2. How to fix existing osds with wrong partition layout |
||
367 | |||
368 | In the first version of DCL we used filestore/3 partition based layout. |
||
369 | In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout. |
||
370 | |||
371 | To convert, we delete the old OSD, clean the partitions and create a new osd: |
||
372 | |||
373 | 14 | Nico Schottelius | h3. Inactive OSD |
374 | 1 | Nico Schottelius | |
375 | 14 | Nico Schottelius | If the OSD is *not active*, we can do the following: |
376 | |||
377 | 13 | Nico Schottelius | * Find the OSD number: mount the partition and find the whoami file |
378 | |||
379 | <pre> |
||
380 | root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/ |
||
381 | root@server2:/opt/ungleich-tools# cat /mnt/whoami |
||
382 | 0 |
||
383 | root@server2:/opt/ungleich-tools# umount /mnt/ |
||
384 | |||
385 | </pre> |
||
386 | |||
387 | * Verify in the *ceph osd tree* that the OSD is on that server |
||
388 | * Deleting the OSD |
||
389 | ** ceph osd crush remove $osd_name |
||
390 | 1 | Nico Schottelius | ** ceph osd rm $osd_name |
391 | 14 | Nico Schottelius | |
392 | Then continue below as described in "Recreating the OSD". |
||
393 | |||
394 | h3. Remove Active OSD |
||
395 | |||
396 | * Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD |
||
397 | * Then continue below as described in "Recreating the OSD". |
||
398 | |||
399 | |||
400 | h3. Recreating the OSD |
||
401 | |||
402 | 13 | Nico Schottelius | * Create an empty partition table |
403 | ** fdisk /dev/sdX |
||
404 | ** g |
||
405 | ** w |
||
406 | * Create a new OSD |
||
407 | ** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS |
||
408 | 15 | Jin-Guk Kwon | |
409 | h2. How to fix unfound pg |
||
410 | |||
411 | refer to https://redmine.ungleich.ch/issues/6388 |
||
412 | 16 | Jin-Guk Kwon | |
413 | * Check health state |
||
414 | ** ceph health detail |
||
415 | * Check which server has that osd |
||
416 | ** ceph osd tree |
||
417 | * Check which VM is running in server place |
||
418 | 17 | Jin-Guk Kwon | ** virsh list |
419 | 16 | Jin-Guk Kwon | * Check pg map |
420 | 17 | Jin-Guk Kwon | ** ceph osd map [osd pool] [VMID] |
421 | 18 | Jin-Guk Kwon | * revert pg |
422 | ** ceph pg [PGID] mark_unfound_lost revert |
||
423 | 28 | Nico Schottelius | |
424 | 60 | Nico Schottelius | h2. Phasing out OSDs |
425 | |||
426 | 61 | Nico Schottelius | * Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently |
427 | 62 | Nico Schottelius | * Or first draining it using @ceph osd crush reweight osd.XX 0@ |
428 | 60 | Nico Schottelius | ** Wait until rebalance done |
429 | 61 | Nico Schottelius | ** Then remove |
430 | 60 | Nico Schottelius | |
431 | 28 | Nico Schottelius | h2. Enabling per image RBD statistics for prometheus |
432 | |||
433 | |||
434 | <pre> |
||
435 | [20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd" |
||
436 | [20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd" |
||
437 | </pre> |
||
438 | 29 | Nico Schottelius | |
439 | h2. S3 Object Storage |
||
440 | |||
441 | 36 | Nico Schottelius | This section is ** UNDER CONTRUCTION ** |
442 | |||
443 | 29 | Nico Schottelius | h3. Introduction |
444 | 1 | Nico Schottelius | |
445 | 30 | Nico Schottelius | * See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw |
446 | * The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/ |
||
447 | 29 | Nico Schottelius | |
448 | h3. Architecture |
||
449 | |||
450 | * S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster. |
||
451 | 34 | Nico Schottelius | * s3 buckets are usually |
452 | 29 | Nico Schottelius | |
453 | 32 | Nico Schottelius | h3. Authentication / Users |
454 | |||
455 | * Ceph *can* make use of LDAP as a backend |
||
456 | 1 | Nico Schottelius | ** However it uses the clear text username+password as a token |
457 | 34 | Nico Schottelius | ** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/ |
458 | 32 | Nico Schottelius | * We do not want users to store their regular account on machines |
459 | * For this reason we use independent users / tokens, but with the same username as in LDAP |
||
460 | |||
461 | 38 | Nico Schottelius | Creating a user: |
462 | |||
463 | <pre> |
||
464 | radosgw-admin user create --uid=USERNAME --display-name="Name of user" |
||
465 | </pre> |
||
466 | |||
467 | |||
468 | Listing users: |
||
469 | |||
470 | <pre> |
||
471 | radosgw-admin user list |
||
472 | </pre> |
||
473 | |||
474 | |||
475 | Deleting users and their storage: |
||
476 | |||
477 | <pre> |
||
478 | radosgw-admin user rm --uid=USERNAME --purge-data |
||
479 | </pre> |
||
480 | |||
481 | 1 | Nico Schottelius | h3. Setting up S3 object storage on Ceph |
482 | 33 | Nico Schottelius | |
483 | * Setup a gateway node with Alpine Linux |
||
484 | ** Change do edge |
||
485 | ** Enable testing |
||
486 | * Update the firewall to allow access from this node to the ceph monitors |
||
487 | 35 | Nico Schottelius | * Setting up the wildcard DNS certificate |
488 | |||
489 | <pre> |
||
490 | apk add ceph-radosgw |
||
491 | </pre> |
||
492 | 37 | Nico Schottelius | |
493 | h3. Wildcard DNS certificate from letsencrypt |
||
494 | |||
495 | Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings. |
||
496 | |||
497 | * run certbot |
||
498 | * update DNS with the first token |
||
499 | * update DNS with the second token |
||
500 | |||
501 | Sample session: |
||
502 | |||
503 | <pre> |
||
504 | s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos |
||
505 | -d *.s3.ungleich.ch -d s3.ungleich.ch |
||
506 | Saving debug log to /var/log/letsencrypt/letsencrypt.log |
||
507 | Plugins selected: Authenticator manual, Installer None |
||
508 | Cert is due for renewal, auto-renewing... |
||
509 | Renewing an existing certificate |
||
510 | Performing the following challenges: |
||
511 | dns-01 challenge for s3.ungleich.ch |
||
512 | dns-01 challenge for s3.ungleich.ch |
||
513 | |||
514 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
515 | NOTE: The IP of this machine will be publicly logged as having requested this |
||
516 | certificate. If you're running certbot in manual mode on a machine that is not |
||
517 | your server, please ensure you're okay with that. |
||
518 | |||
519 | Are you OK with your IP being logged? |
||
520 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
521 | (Y)es/(N)o: y |
||
522 | |||
523 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
524 | Please deploy a DNS TXT record under the name |
||
525 | _acme-challenge.s3.ungleich.ch with the following value: |
||
526 | |||
527 | KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0 |
||
528 | |||
529 | Before continuing, verify the record is deployed. |
||
530 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
531 | Press Enter to Continue |
||
532 | |||
533 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
534 | Please deploy a DNS TXT record under the name |
||
535 | _acme-challenge.s3.ungleich.ch with the following value: |
||
536 | |||
537 | bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI |
||
538 | |||
539 | Before continuing, verify the record is deployed. |
||
540 | (This must be set up in addition to the previous challenges; do not remove, |
||
541 | replace, or undo the previous challenge tasks yet. Note that you might be |
||
542 | asked to create multiple distinct TXT records with the same name. This is |
||
543 | permitted by DNS standards.) |
||
544 | |||
545 | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
||
546 | Press Enter to Continue |
||
547 | Waiting for verification... |
||
548 | Cleaning up challenges |
||
549 | |||
550 | IMPORTANT NOTES: |
||
551 | - Congratulations! Your certificate and chain have been saved at: |
||
552 | /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem |
||
553 | Your key file has been saved at: |
||
554 | /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem |
||
555 | Your cert will expire on 2020-12-09. To obtain a new or tweaked |
||
556 | version of this certificate in the future, simply run certbot |
||
557 | again. To non-interactively renew *all* of your certificates, run |
||
558 | "certbot renew" |
||
559 | - If you like Certbot, please consider supporting our work by: |
||
560 | |||
561 | Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate |
||
562 | Donating to EFF: https://eff.org/donate-le |
||
563 | |||
564 | </pre> |
||
565 | 41 | Nico Schottelius | |
566 | h2. Debugging ceph |
||
567 | |||
568 | |||
569 | <pre> |
||
570 | ceph status |
||
571 | ceph osd status |
||
572 | ceph osd df |
||
573 | ceph osd utilization |
||
574 | ceph osd pool stats |
||
575 | ceph osd tree |
||
576 | ceph pg stat |
||
577 | </pre> |
||
578 | 42 | Nico Schottelius | |
579 | 53 | Nico Schottelius | h3. How to list the version overview |
580 | |||
581 | 55 | Nico Schottelius | This lists the versions of osds, mgrs and mons: |
582 | |||
583 | 53 | Nico Schottelius | <pre> |
584 | ceph versions |
||
585 | </pre> |
||
586 | 55 | Nico Schottelius | |
587 | Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@: |
||
588 | |||
589 | <pre> |
||
590 | [15:32:20] red1.place5:~# ceph features |
||
591 | { |
||
592 | "mon": [ |
||
593 | { |
||
594 | "features": "0x3ffddff8ffecffff", |
||
595 | "release": "luminous", |
||
596 | "num": 5 |
||
597 | } |
||
598 | ], |
||
599 | "osd": [ |
||
600 | { |
||
601 | "features": "0x3ffddff8ffecffff", |
||
602 | "release": "luminous", |
||
603 | "num": 44 |
||
604 | } |
||
605 | ], |
||
606 | "client": [ |
||
607 | { |
||
608 | "features": "0x3ffddff8eea4fffb", |
||
609 | "release": "luminous", |
||
610 | "num": 4 |
||
611 | }, |
||
612 | { |
||
613 | "features": "0x3ffddff8ffacffff", |
||
614 | "release": "luminous", |
||
615 | "num": 18 |
||
616 | }, |
||
617 | { |
||
618 | "features": "0x3ffddff8ffecffff", |
||
619 | "release": "luminous", |
||
620 | "num": 31 |
||
621 | } |
||
622 | ], |
||
623 | "mgr": [ |
||
624 | { |
||
625 | "features": "0x3ffddff8ffecffff", |
||
626 | "release": "luminous", |
||
627 | "num": 4 |
||
628 | } |
||
629 | ] |
||
630 | } |
||
631 | |||
632 | </pre> |
||
633 | |||
634 | 53 | Nico Schottelius | |
635 | h3. How to list the version of every OSD and every monitor |
||
636 | |||
637 | To list the version of each ceph OSD: |
||
638 | |||
639 | <pre> |
||
640 | ceph tell osd.* version |
||
641 | </pre> |
||
642 | |||
643 | To list the version of each ceph mon: |
||
644 | 2 |
||
645 | <pre> |
||
646 | ceph tell mon.* version |
||
647 | </pre> |
||
648 | |||
649 | The mgr do not seem to support this command as of 14.2.21. |
||
650 | |||
651 | 49 | Nico Schottelius | h2. Performance Tuning |
652 | |||
653 | * Ensure that the basic options for reducing rebalancing workload are set: |
||
654 | |||
655 | <pre> |
||
656 | osd max backfills = 1 |
||
657 | osd recovery max active = 1 |
||
658 | osd recovery op priority = 2 |
||
659 | </pre> |
||
660 | |||
661 | * Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high** |
||
662 | ** Requires OSD restart on change |
||
663 | |||
664 | 50 | Nico Schottelius | <pre> |
665 | ceph config set global osd_op_queue_cut_off high |
||
666 | </pre> |
||
667 | |||
668 | 51 | Nico Schottelius | <pre> |
669 | be sure to check your osd recovery sleep settings, there are several |
||
670 | depending on your underlying drives: |
||
671 | |||
672 | "osd_recovery_sleep": "0.000000", |
||
673 | "osd_recovery_sleep_hdd": "0.050000", |
||
674 | "osd_recovery_sleep_hybrid": "0.050000", |
||
675 | "osd_recovery_sleep_ssd": "0.050000", |
||
676 | |||
677 | Adjusting these will upwards will dramatically reduce IO, and take effect |
||
678 | immediately at the cost of slowing rebalance/recovery. |
||
679 | </pre> |
||
680 | |||
681 | 52 | Nico Schottelius | Reference settings from Frank Schilder: |
682 | |||
683 | <pre> |
||
684 | osd class:hdd advanced osd_recovery_sleep 0.050000 |
||
685 | osd class:rbd_data advanced osd_recovery_sleep 0.025000 |
||
686 | osd class:rbd_meta advanced osd_recovery_sleep 0.002500 |
||
687 | osd class:ssd advanced osd_recovery_sleep 0.002500 |
||
688 | osd advanced osd_recovery_sleep 0.050000 |
||
689 | |||
690 | osd class:hdd advanced osd_max_backfills 3 |
||
691 | osd class:rbd_data advanced osd_max_backfills 6 |
||
692 | osd class:rbd_meta advanced osd_max_backfills 12 |
||
693 | osd class:ssd advanced osd_max_backfills 12 |
||
694 | osd advanced osd_max_backfills 3 |
||
695 | |||
696 | osd class:hdd advanced osd_recovery_max_active 8 |
||
697 | osd class:rbd_data advanced osd_recovery_max_active 16 |
||
698 | osd class:rbd_meta advanced osd_recovery_max_active 32 |
||
699 | osd class:ssd advanced osd_recovery_max_active 32 |
||
700 | osd advanced osd_recovery_max_active 8 |
||
701 | </pre> |
||
702 | |||
703 | (have not yet been tested in our clusters) |
||
704 | 51 | Nico Schottelius | |
705 | 42 | Nico Schottelius | h2. Ceph theory |
706 | |||
707 | h3. How much data per Server? |
||
708 | |||
709 | Q: How much data should we add into one server? |
||
710 | A: Not more than it can handle. |
||
711 | |||
712 | How much data can a server handle? For this let's have a look at 2 scenarios: |
||
713 | |||
714 | * How long does it take to compensate the loss of the server? |
||
715 | |||
716 | * Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s. |
||
717 | * And our estimated rebuild goal is to compensate the loss of a server within U hours. |
||
718 | |||
719 | |||
720 | h4. Approach 1 |
||
721 | |||
722 | Then |
||
723 | |||
724 | Let's take an example: |
||
725 | |||
726 | * A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s. |
||
727 | * 100000/1.25 = 80000s = 22.22h |
||
728 | |||
729 | However, our logic assumes that we actually rebuild from the failed server, which... is failed. |
||
730 | |||
731 | h4. Approach 2: calculating with left servers |
||
732 | |||
733 | However we can apply our logic also to distribute |
||
734 | the rebuild over several servers that now pull in data from each other for rebuilding. |
||
735 | We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s |
||
736 | network connection. |
||
737 | |||
738 | Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions. |
||
739 | |||
740 | However how fast can we actually read data from the disks? |
||
741 | |||
742 | * SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume |
||
743 | * HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic) |
||
744 | |||
745 | |||
746 | |||
747 | |||
748 | Further assumptions: |
||
749 | |||
750 | * Assuming further that each disk should be dedicated at least one CPU core. |
||
751 | 43 | Nico Schottelius | |
752 | h3. Disk/SSD speeds |
||
753 | |||
754 | 44 | Nico Schottelius | * Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8 |
755 | 43 | Nico Schottelius | * Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential |
756 | * Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential |
||
757 | * Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers) |
||
758 | 47 | Dominique Roux | |
759 | 48 | Dominique Roux | h3. Ceph theoretical fundament |
760 | 47 | Dominique Roux | |
761 | If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf |