Project

General

Profile

The ungleich ceph handbook » History » Version 63

Nico Schottelius, 09/19/2022 07:09 AM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
65
66 63 Nico Schottelius
h3. Checking and clearing crash reports
67
68
If the cluster is reporting HEALTH_WARN and a recent crash such as:
69
70
<pre>
71
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
72
  cluster:
73
    id:     ...
74
    health: HEALTH_WARN
75
            1 daemons have recently crashed
76
</pre>
77
78
One can analyse it using
79
80
* List the crashes: @ceph crash ls@ 
81
* Checkout the details: @ceph crash info <id>@
82
83
To archive the error:
84
85
* To archive a specific report: @ceph crash archive <id>@
86
* To archive all: @ceph crash archive-all@
87
88
After archiving, the cluster health should return to HEALTH_OK:
89
90
<pre>
91
[rook@rook-ceph-tools-f569797b4-z4542 /]$  ceph crash ls
92
ID                                                                ENTITY  NEW  
93
2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc  mon.c    *   
94
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph crash archive 2022-09-19T04:33:19.378500Z_b2e26755-0712-41de-bf2b-b370dbe94ebc 
95
[rook@rook-ceph-tools-f569797b4-z4542 /]$ ceph -s
96
  cluster:
97
    id:     ..
98
    health: HEALTH_OK
99
 
100
</pre>
101
102 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
103 1 Nico Schottelius
104 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
105
106 46 Nico Schottelius
h3. Checking the shadow trees
107
108
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
109
using @ceph osd crush tree --show-shadow@:
110
111
<pre>
112
-16   hdd-big 653.03418           root default~hdd-big        
113
-34   hdd-big         0         0     host server14~hdd-big   
114
-38   hdd-big         0         0     host server15~hdd-big   
115
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
116
 36   hdd-big   9.09560   9.09560         osd.36              
117
 59   hdd-big   9.09499   9.09499         osd.59              
118
 60   hdd-big   9.09499   9.09499         osd.60              
119
 68   hdd-big   9.09599   8.93999         osd.68              
120
 69   hdd-big   9.09599   7.65999         osd.69              
121
 70   hdd-big   9.09599   8.35899         osd.70              
122
 71   hdd-big   9.09599   8.56000         osd.71              
123
 72   hdd-big   9.09599   8.93700         osd.72              
124
 73   hdd-big   9.09599   8.54199         osd.73              
125
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
126
...
127
</pre>
128
129
130
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
131
SSDs and other classes have their own shadow trees, too.
132
133
134 2 Nico Schottelius
h3. For Dell servers
135
136
First find the disk and then add it to the operating system
137
138
<pre>
139
megacli -PDList -aALL  | grep -B16 -i unconfigur
140
141
# Sample output:
142
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
143
Enclosure Device ID: N/A
144
Slot Number: 0
145
Enclosure position: N/A
146
Device Id: 0
147
WWN: 0000000000000000
148
Sequence Number: 1
149
Media Error Count: 0
150
Other Error Count: 0
151
Predictive Failure Count: 0
152
Last Predictive Failure Event Seq Number: 0
153
PD Type: SATA
154
155
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
156
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
157
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
158
Sector Size:  0
159
Firmware state: Unconfigured(good), Spun Up
160
</pre>
161
162
Then add the disk to the OS:
163
164
<pre>
165 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
166 2 Nico Schottelius
167
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
168
megacli -CfgLdAdd -r0 [32:0] -a0
169
170
# Sample call, if enclosure is N/A
171 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
172 25 Jin-Guk Kwon
</pre>
173
174
Then check disk
175
176
<pre>
177
fdisk -l
178
[11:26:23] server2.place6:~# fdisk -l
179
......
180
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
181
Units: sectors of 1 * 512 = 512 bytes
182
Sector size (logical/physical): 512 bytes / 512 bytes
183
I/O size (minimum/optimal): 512 bytes / 512 bytes
184
[11:27:24] server2.place6:~#
185
</pre>
186
187
Then create gpt
188
189
<pre>
190
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
191
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
192
......
193
Created a new DOS disklabel with disk identifier 0x9c4a0355.
194
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
195
......
196
</pre>
197
198
Then create osd for ssd/hdd-big
199
200
<pre>
201
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
202
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
203
+ set -e
204
+ [ 2 -lt 2 ]
205
......
206
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
207
osd.14
208
[ ok ] Restarting daemon monitor: monit.
209
[11:36:14] server2.place6:~#
210
</pre>
211
212
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
213
214
<pre>
215
ceph -s
216
[12:37:57] server2.place6:~# ceph -s
217
  cluster:
218
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
219
    health: HEALTH_WARN
220
            2248811/49628409 objects misplaced (4.531%)
221
......
222
  io:
223
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
224
    recovery: 27.1MiB/s, 6objects/s
225
[12:49:41] server2.place6:~#
226 2 Nico Schottelius
</pre>
227
228 1 Nico Schottelius
h2. Moving a disk/ssd to another server
229 4 Nico Schottelius
230
(needs to be described better)
231
232
Generally speaking:
233
234 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
235 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
236
** Stop the osd, remove monit on the server you want to take it out
237
** umount the disk
238 1 Nico Schottelius
* Take disk out
239
* Discard preserved cache on the server you took it out 
240 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
241 1 Nico Schottelius
* Insert into new server
242 9 Nico Schottelius
* Clear foreign configuration
243 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
244 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
245
** No creating of the osd required!
246
* Verify that the disk exists and that the osd is started
247
** using *ps aux*
248
** using *ceph osd tree*
249 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
250 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
251
** Reload monit
252 11 Nico Schottelius
* Verify monit using *monit status*
253 1 Nico Schottelius
254 56 Nico Schottelius
h2. OSD related processes
255 1 Nico Schottelius
256 56 Nico Schottelius
h3. Removing a disk/ssd
257
258 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
259
260 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
261 1 Nico Schottelius
262
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
263
264
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
265
* Check **ceph -s**, find host using **ceph osd tree**
266
* Login to the affected host
267
* Run the following commands:
268
** ls /var/lib/ceph/osd/ceph-XX
269
** dmesg
270 24 Jin-Guk Kwon
<pre>
271
ex) After checking message of dmesg, you can do next step
272
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
273
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
274
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
275
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
276
</pre>
277
278 1 Nico Schottelius
* Create a new ticket in the datacenter light project
279
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
280
** Add (partial) output of above commands
281
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
282
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
283
*** Create a short letter to the vendor, including technical details a from above
284
*** Record when you sent it in
285
*** Put ticket into status waiting
286
** If there is no warranty, dispose it
287
288 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
289
290
h3. Configuring auto repair on pgs
291
292
<pre>
293
ceph config set osd osd_scrub_auto_repair true
294
</pre>
295
296
Verify using:
297
298
<pre>
299
ceph config dump
300
</pre>
301 39 Jin-Guk Kwon
302 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
303
304
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
305
306
The default configuration on our servers contains:
307
308
<pre>
309
[osd]
310
osd max backfills = 1
311
osd recovery max active = 1
312
osd recovery op priority = 2
313
</pre>
314
315
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
316
317
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
318
319
<pre>
320
ceph tell osd.* injectargs '--osd-max-backfills Y'
321
ceph tell osd.* injectargs '--osd-recovery-max-active X'
322
</pre>
323
324
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
325
326
h2. Debug scrub errors / inconsistent pg message
327 6 Nico Schottelius
328 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
329
330
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
331 12 Nico Schottelius
332
h2. Move servers into the osd tree
333
334
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
335
Output might look as follows:
336
337
<pre>
338
[11:19:27] server5.place6:~# ceph osd tree
339
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
340
 -3           0.87270 host server5                             
341
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
342
 -1         251.85580 root default                             
343
 -7          81.56271     host server2                         
344
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
345
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
346
...
347
</pre>
348
349
350
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
351
which will move the bucket in the right place:
352
353
<pre>
354
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
355
moved item id -3 name 'server5' to location {root=default} in crush map
356
[11:32:12] server5.place6:~# ceph osd tree
357
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
358
 -1         252.72850 root default                             
359
...
360
 -3           0.87270     host server5                         
361
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
362
363
364
</pre>
365 13 Nico Schottelius
366
h2. How to fix existing osds with wrong partition layout
367
368
In the first version of DCL we used filestore/3 partition based layout.
369
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
370
371
To convert, we delete the old OSD, clean the partitions and create a new osd:
372
373 14 Nico Schottelius
h3. Inactive OSD
374 1 Nico Schottelius
375 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
376
377 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
378
379
<pre>
380
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
381
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
382
0
383
root@server2:/opt/ungleich-tools# umount  /mnt/
384
385
</pre>
386
387
* Verify in the *ceph osd tree* that the OSD is on that server
388
* Deleting the OSD
389
** ceph osd crush remove $osd_name
390 1 Nico Schottelius
** ceph osd rm $osd_name
391 14 Nico Schottelius
392
Then continue below as described in "Recreating the OSD".
393
394
h3. Remove Active OSD
395
396
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
397
* Then continue below as described in "Recreating the OSD".
398
399
400
h3. Recreating the OSD
401
402 13 Nico Schottelius
* Create an empty partition table
403
** fdisk /dev/sdX
404
** g
405
** w
406
* Create a new OSD
407
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
408 15 Jin-Guk Kwon
409
h2. How to fix unfound pg
410
411
refer to https://redmine.ungleich.ch/issues/6388
412 16 Jin-Guk Kwon
413
* Check health state 
414
** ceph health detail
415
* Check which server has that osd
416
** ceph osd tree
417
* Check which VM is running in server place
418 17 Jin-Guk Kwon
** virsh list  
419 16 Jin-Guk Kwon
* Check pg map
420 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
421 18 Jin-Guk Kwon
* revert pg
422
** ceph pg [PGID] mark_unfound_lost revert
423 28 Nico Schottelius
424 60 Nico Schottelius
h2. Phasing out OSDs
425
426 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
427 62 Nico Schottelius
* Or first draining it using @ceph osd crush reweight osd.XX 0@
428 60 Nico Schottelius
** Wait until rebalance done
429 61 Nico Schottelius
** Then remove
430 60 Nico Schottelius
431 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
432
433
434
<pre>
435
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
436
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
437
</pre>
438 29 Nico Schottelius
439
h2. S3 Object Storage
440
441 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
442
443 29 Nico Schottelius
h3. Introduction
444 1 Nico Schottelius
445 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
446
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
447 29 Nico Schottelius
448
h3. Architecture
449
450
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
451 34 Nico Schottelius
* s3 buckets are usually
452 29 Nico Schottelius
453 32 Nico Schottelius
h3. Authentication / Users
454
455
* Ceph *can* make use of LDAP as a backend
456 1 Nico Schottelius
** However it uses the clear text username+password as a token
457 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
458 32 Nico Schottelius
* We do not want users to store their regular account on machines
459
* For this reason we use independent users / tokens, but with the same username as in LDAP
460
461 38 Nico Schottelius
Creating a user:
462
463
<pre>
464
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
465
</pre>
466
467
468
Listing users:
469
470
<pre>
471
radosgw-admin user list
472
</pre>
473
474
475
Deleting users and their storage:
476
477
<pre>
478
radosgw-admin user rm --uid=USERNAME --purge-data
479
</pre>
480
481 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
482 33 Nico Schottelius
483
* Setup a gateway node with Alpine Linux
484
** Change do edge
485
** Enable testing
486
* Update the firewall to allow access from this node to the ceph monitors
487 35 Nico Schottelius
* Setting up the wildcard DNS certificate
488
489
<pre>
490
apk add ceph-radosgw
491
</pre>
492 37 Nico Schottelius
493
h3. Wildcard DNS certificate from letsencrypt
494
495
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
496
497
* run certbot
498
* update DNS with the first token
499
* update DNS with the second token
500
501
Sample session:
502
503
<pre>
504
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
505
-d *.s3.ungleich.ch -d s3.ungleich.ch
506
Saving debug log to /var/log/letsencrypt/letsencrypt.log
507
Plugins selected: Authenticator manual, Installer None
508
Cert is due for renewal, auto-renewing...
509
Renewing an existing certificate
510
Performing the following challenges:
511
dns-01 challenge for s3.ungleich.ch
512
dns-01 challenge for s3.ungleich.ch
513
514
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
515
NOTE: The IP of this machine will be publicly logged as having requested this
516
certificate. If you're running certbot in manual mode on a machine that is not
517
your server, please ensure you're okay with that.
518
519
Are you OK with your IP being logged?
520
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
521
(Y)es/(N)o: y
522
523
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
524
Please deploy a DNS TXT record under the name
525
_acme-challenge.s3.ungleich.ch with the following value:
526
527
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
528
529
Before continuing, verify the record is deployed.
530
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
531
Press Enter to Continue
532
533
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
534
Please deploy a DNS TXT record under the name
535
_acme-challenge.s3.ungleich.ch with the following value:
536
537
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
538
539
Before continuing, verify the record is deployed.
540
(This must be set up in addition to the previous challenges; do not remove,
541
replace, or undo the previous challenge tasks yet. Note that you might be
542
asked to create multiple distinct TXT records with the same name. This is
543
permitted by DNS standards.)
544
545
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
546
Press Enter to Continue
547
Waiting for verification...
548
Cleaning up challenges
549
550
IMPORTANT NOTES:
551
 - Congratulations! Your certificate and chain have been saved at:
552
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
553
   Your key file has been saved at:
554
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
555
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
556
   version of this certificate in the future, simply run certbot
557
   again. To non-interactively renew *all* of your certificates, run
558
   "certbot renew"
559
 - If you like Certbot, please consider supporting our work by:
560
561
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
562
   Donating to EFF:                    https://eff.org/donate-le
563
564
</pre>
565 41 Nico Schottelius
566
h2. Debugging ceph
567
568
569
<pre>
570
    ceph status
571
    ceph osd status
572
    ceph osd df
573
    ceph osd utilization
574
    ceph osd pool stats
575
    ceph osd tree
576
    ceph pg stat
577
</pre>
578 42 Nico Schottelius
579 53 Nico Schottelius
h3. How to list the version overview
580
581 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
582
583 53 Nico Schottelius
<pre>
584
ceph versions
585
</pre>
586 55 Nico Schottelius
587
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
588
589
<pre>
590
[15:32:20] red1.place5:~# ceph features
591
{
592
    "mon": [
593
        {
594
            "features": "0x3ffddff8ffecffff",
595
            "release": "luminous",
596
            "num": 5
597
        }
598
    ],
599
    "osd": [
600
        {
601
            "features": "0x3ffddff8ffecffff",
602
            "release": "luminous",
603
            "num": 44
604
        }
605
    ],
606
    "client": [
607
        {
608
            "features": "0x3ffddff8eea4fffb",
609
            "release": "luminous",
610
            "num": 4
611
        },
612
        {
613
            "features": "0x3ffddff8ffacffff",
614
            "release": "luminous",
615
            "num": 18
616
        },
617
        {
618
            "features": "0x3ffddff8ffecffff",
619
            "release": "luminous",
620
            "num": 31
621
        }
622
    ],
623
    "mgr": [
624
        {
625
            "features": "0x3ffddff8ffecffff",
626
            "release": "luminous",
627
            "num": 4
628
        }
629
    ]
630
}
631
632
</pre>
633
 
634 53 Nico Schottelius
635
h3. How to list the version of every OSD and every monitor
636
637
To list the version of each ceph OSD:
638
639
<pre>
640
ceph tell osd.* version
641
</pre>
642
643
To list the version of each ceph mon:
644
2
645
<pre>
646
ceph tell mon.* version
647
</pre>
648
649
The mgr do not seem to support this command as of 14.2.21.
650
651 49 Nico Schottelius
h2. Performance Tuning
652
653
* Ensure that the basic options for reducing rebalancing workload are set:
654
655
<pre>
656
osd max backfills = 1
657
osd recovery max active = 1
658
osd recovery op priority = 2
659
</pre>
660
661
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
662
** Requires OSD restart on change
663
664 50 Nico Schottelius
<pre>
665
ceph config set global osd_op_queue_cut_off high
666
</pre>
667
668 51 Nico Schottelius
<pre>
669
be sure to check your osd recovery sleep settings, there are several
670
depending on your underlying drives:
671
672
    "osd_recovery_sleep": "0.000000",
673
    "osd_recovery_sleep_hdd": "0.050000",
674
    "osd_recovery_sleep_hybrid": "0.050000",
675
    "osd_recovery_sleep_ssd": "0.050000",
676
677
Adjusting these will upwards will dramatically reduce IO, and take effect
678
immediately at the cost of slowing rebalance/recovery.
679
</pre>
680
681 52 Nico Schottelius
Reference settings from Frank Schilder:
682
683
<pre>
684
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
685
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
686
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
687
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
688
  osd                      advanced osd_recovery_sleep                0.050000
689
690
  osd       class:hdd      advanced osd_max_backfills                 3
691
  osd       class:rbd_data advanced osd_max_backfills                 6
692
  osd       class:rbd_meta advanced osd_max_backfills                 12
693
  osd       class:ssd      advanced osd_max_backfills                 12
694
  osd                      advanced osd_max_backfills                 3
695
696
  osd       class:hdd      advanced osd_recovery_max_active           8
697
  osd       class:rbd_data advanced osd_recovery_max_active           16
698
  osd       class:rbd_meta advanced osd_recovery_max_active           32
699
  osd       class:ssd      advanced osd_recovery_max_active           32
700
  osd                      advanced osd_recovery_max_active           8
701
</pre>
702
703
(have not yet been tested in our clusters)
704 51 Nico Schottelius
705 42 Nico Schottelius
h2. Ceph theory
706
707
h3. How much data per Server?
708
709
Q: How much data should we add into one server?
710
A: Not more than it can handle.
711
712
How much data can a server handle? For this let's have a look at 2 scenarios:
713
714
* How long does it take to compensate the loss of the server?
715
716
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
717
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
718
719
720
h4. Approach 1
721
722
Then
723
724
Let's take an example: 
725
726
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
727
* 100000/1.25 = 80000s = 22.22h
728
729
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
730
731
h4. Approach 2: calculating with left servers
732
733
However we can apply our logic also to distribute
734
the rebuild over several servers that now pull in data from each other for rebuilding.
735
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
736
network connection.
737
738
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
739
740
However how fast can we actually read data from the disks? 
741
742
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
743
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
744
745
 
746
747
748
Further assumptions:
749
750
* Assuming further that each disk should be dedicated at least one CPU core.
751 43 Nico Schottelius
752
h3. Disk/SSD speeds
753
754 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
755 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
756
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
757
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
758 47 Dominique Roux
759 48 Dominique Roux
h3. Ceph theoretical fundament
760 47 Dominique Roux
761
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf