Project

General

Profile

The ungleich ceph handbook » History » Version 61

Nico Schottelius, 05/28/2022 05:42 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 57 Nico Schottelius
h3. Show config
53
54
<pre>
55
ceph config dump
56
</pre>
57
58 58 Nico Schottelius
h3. Show backfill and recovery config
59
60
<pre>
61
ceph-conf --show-config | egrep "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
62
</pre>
63
64 59 Nico Schottelius
* See also: https://www.suse.com/support/kb/doc/?id=000019693
65
66 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
67 1 Nico Schottelius
68 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
69
70 46 Nico Schottelius
h3. Checking the shadow trees
71
72
To be able to spot differences / weights of hosts, it can be very helpful to look at the crush shadow tree
73
using @ceph osd crush tree --show-shadow@:
74
75
<pre>
76
-16   hdd-big 653.03418           root default~hdd-big        
77
-34   hdd-big         0         0     host server14~hdd-big   
78
-38   hdd-big         0         0     host server15~hdd-big   
79
-42   hdd-big  81.86153  78.28352     host server17~hdd-big   
80
 36   hdd-big   9.09560   9.09560         osd.36              
81
 59   hdd-big   9.09499   9.09499         osd.59              
82
 60   hdd-big   9.09499   9.09499         osd.60              
83
 68   hdd-big   9.09599   8.93999         osd.68              
84
 69   hdd-big   9.09599   7.65999         osd.69              
85
 70   hdd-big   9.09599   8.35899         osd.70              
86
 71   hdd-big   9.09599   8.56000         osd.71              
87
 72   hdd-big   9.09599   8.93700         osd.72              
88
 73   hdd-big   9.09599   8.54199         osd.73              
89
-46   hdd-big  90.94986  90.94986     host server18~hdd-big   
90
...
91
</pre>
92
93
94
Here we can see that the weight of server17 for the class hdd-big is about 81, the one of server18 about 90.
95
SSDs and other classes have their own shadow trees, too.
96
97
98 2 Nico Schottelius
h3. For Dell servers
99
100
First find the disk and then add it to the operating system
101
102
<pre>
103
megacli -PDList -aALL  | grep -B16 -i unconfigur
104
105
# Sample output:
106
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
107
Enclosure Device ID: N/A
108
Slot Number: 0
109
Enclosure position: N/A
110
Device Id: 0
111
WWN: 0000000000000000
112
Sequence Number: 1
113
Media Error Count: 0
114
Other Error Count: 0
115
Predictive Failure Count: 0
116
Last Predictive Failure Event Seq Number: 0
117
PD Type: SATA
118
119
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
120
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
121
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
122
Sector Size:  0
123
Firmware state: Unconfigured(good), Spun Up
124
</pre>
125
126
Then add the disk to the OS:
127
128
<pre>
129 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
130 2 Nico Schottelius
131
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
132
megacli -CfgLdAdd -r0 [32:0] -a0
133
134
# Sample call, if enclosure is N/A
135 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
136 25 Jin-Guk Kwon
</pre>
137
138
Then check disk
139
140
<pre>
141
fdisk -l
142
[11:26:23] server2.place6:~# fdisk -l
143
......
144
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
145
Units: sectors of 1 * 512 = 512 bytes
146
Sector size (logical/physical): 512 bytes / 512 bytes
147
I/O size (minimum/optimal): 512 bytes / 512 bytes
148
[11:27:24] server2.place6:~#
149
</pre>
150
151
Then create gpt
152
153
<pre>
154
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
155
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
156
......
157
Created a new DOS disklabel with disk identifier 0x9c4a0355.
158
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
159
......
160
</pre>
161
162
Then create osd for ssd/hdd-big
163
164
<pre>
165
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
166
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
167
+ set -e
168
+ [ 2 -lt 2 ]
169
......
170
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
171
osd.14
172
[ ok ] Restarting daemon monitor: monit.
173
[11:36:14] server2.place6:~#
174
</pre>
175
176
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
177
178
<pre>
179
ceph -s
180
[12:37:57] server2.place6:~# ceph -s
181
  cluster:
182
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
183
    health: HEALTH_WARN
184
            2248811/49628409 objects misplaced (4.531%)
185
......
186
  io:
187
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
188
    recovery: 27.1MiB/s, 6objects/s
189
[12:49:41] server2.place6:~#
190 2 Nico Schottelius
</pre>
191
192 1 Nico Schottelius
h2. Moving a disk/ssd to another server
193 4 Nico Schottelius
194
(needs to be described better)
195
196
Generally speaking:
197
198 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
199 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
200
** Stop the osd, remove monit on the server you want to take it out
201
** umount the disk
202 1 Nico Schottelius
* Take disk out
203
* Discard preserved cache on the server you took it out 
204 54 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -aAll@
205 1 Nico Schottelius
* Insert into new server
206 9 Nico Schottelius
* Clear foreign configuration
207 54 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -aAll@
208 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
209
** No creating of the osd required!
210
* Verify that the disk exists and that the osd is started
211
** using *ps aux*
212
** using *ceph osd tree*
213 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
214 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
215
** Reload monit
216 11 Nico Schottelius
* Verify monit using *monit status*
217 1 Nico Schottelius
218 56 Nico Schottelius
h2. OSD related processes
219 1 Nico Schottelius
220 56 Nico Schottelius
h3. Removing a disk/ssd
221
222 1 Nico Schottelius
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
223
224 56 Nico Schottelius
h3. Handling DOWN osds with filesystem errors
225 1 Nico Schottelius
226
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
227
228
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
229
* Check **ceph -s**, find host using **ceph osd tree**
230
* Login to the affected host
231
* Run the following commands:
232
** ls /var/lib/ceph/osd/ceph-XX
233
** dmesg
234 24 Jin-Guk Kwon
<pre>
235
ex) After checking message of dmesg, you can do next step
236
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
237
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
238
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
239
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
240
</pre>
241
242 1 Nico Schottelius
* Create a new ticket in the datacenter light project
243
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
244
** Add (partial) output of above commands
245
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
246
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
247
*** Create a short letter to the vendor, including technical details a from above
248
*** Record when you sent it in
249
*** Put ticket into status waiting
250
** If there is no warranty, dispose it
251
252 56 Nico Schottelius
h3. [[Create new pool and place new osd]]
253
254
h3. Configuring auto repair on pgs
255
256
<pre>
257
ceph config set osd osd_scrub_auto_repair true
258
</pre>
259
260
Verify using:
261
262
<pre>
263
ceph config dump
264
</pre>
265 39 Jin-Guk Kwon
266 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
267
268
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
269
270
The default configuration on our servers contains:
271
272
<pre>
273
[osd]
274
osd max backfills = 1
275
osd recovery max active = 1
276
osd recovery op priority = 2
277
</pre>
278
279
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
280
281
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
282
283
<pre>
284
ceph tell osd.* injectargs '--osd-max-backfills Y'
285
ceph tell osd.* injectargs '--osd-recovery-max-active X'
286
</pre>
287
288
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
289
290
h2. Debug scrub errors / inconsistent pg message
291 6 Nico Schottelius
292 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
293
294
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
295 12 Nico Schottelius
296
h2. Move servers into the osd tree
297
298
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
299
Output might look as follows:
300
301
<pre>
302
[11:19:27] server5.place6:~# ceph osd tree
303
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
304
 -3           0.87270 host server5                             
305
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
306
 -1         251.85580 root default                             
307
 -7          81.56271     host server2                         
308
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
309
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
310
...
311
</pre>
312
313
314
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
315
which will move the bucket in the right place:
316
317
<pre>
318
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
319
moved item id -3 name 'server5' to location {root=default} in crush map
320
[11:32:12] server5.place6:~# ceph osd tree
321
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
322
 -1         252.72850 root default                             
323
...
324
 -3           0.87270     host server5                         
325
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
326
327
328
</pre>
329 13 Nico Schottelius
330
h2. How to fix existing osds with wrong partition layout
331
332
In the first version of DCL we used filestore/3 partition based layout.
333
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
334
335
To convert, we delete the old OSD, clean the partitions and create a new osd:
336
337 14 Nico Schottelius
h3. Inactive OSD
338 1 Nico Schottelius
339 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
340
341 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
342
343
<pre>
344
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
345
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
346
0
347
root@server2:/opt/ungleich-tools# umount  /mnt/
348
349
</pre>
350
351
* Verify in the *ceph osd tree* that the OSD is on that server
352
* Deleting the OSD
353
** ceph osd crush remove $osd_name
354 1 Nico Schottelius
** ceph osd rm $osd_name
355 14 Nico Schottelius
356
Then continue below as described in "Recreating the OSD".
357
358
h3. Remove Active OSD
359
360
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
361
* Then continue below as described in "Recreating the OSD".
362
363
364
h3. Recreating the OSD
365
366 13 Nico Schottelius
* Create an empty partition table
367
** fdisk /dev/sdX
368
** g
369
** w
370
* Create a new OSD
371
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
372 15 Jin-Guk Kwon
373
h2. How to fix unfound pg
374
375
refer to https://redmine.ungleich.ch/issues/6388
376 16 Jin-Guk Kwon
377
* Check health state 
378
** ceph health detail
379
* Check which server has that osd
380
** ceph osd tree
381
* Check which VM is running in server place
382 17 Jin-Guk Kwon
** virsh list  
383 16 Jin-Guk Kwon
* Check pg map
384 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
385 18 Jin-Guk Kwon
* revert pg
386
** ceph pg [PGID] mark_unfound_lost revert
387 28 Nico Schottelius
388 60 Nico Schottelius
h2. Phasing out OSDs
389
390 61 Nico Schottelius
* Either directly via /opt/ungleich-tools/ceph/ceph-osd-stop-remove-permanently
391 60 Nico Schottelius
* Or first draining it using @ceph osd crush reweight {name} {weight}@
392
** Wait until rebalance done
393 61 Nico Schottelius
** Then remove
394 60 Nico Schottelius
395 28 Nico Schottelius
h2. Enabling per image RBD statistics for prometheus
396
397
398
<pre>
399
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
400
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
401
</pre>
402 29 Nico Schottelius
403
h2. S3 Object Storage
404
405 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
406
407 29 Nico Schottelius
h3. Introduction
408 1 Nico Schottelius
409 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
410
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
411 29 Nico Schottelius
412
h3. Architecture
413
414
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
415 34 Nico Schottelius
* s3 buckets are usually
416 29 Nico Schottelius
417 32 Nico Schottelius
h3. Authentication / Users
418
419
* Ceph *can* make use of LDAP as a backend
420 1 Nico Schottelius
** However it uses the clear text username+password as a token
421 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
422 32 Nico Schottelius
* We do not want users to store their regular account on machines
423
* For this reason we use independent users / tokens, but with the same username as in LDAP
424
425 38 Nico Schottelius
Creating a user:
426
427
<pre>
428
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
429
</pre>
430
431
432
Listing users:
433
434
<pre>
435
radosgw-admin user list
436
</pre>
437
438
439
Deleting users and their storage:
440
441
<pre>
442
radosgw-admin user rm --uid=USERNAME --purge-data
443
</pre>
444
445 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
446 33 Nico Schottelius
447
* Setup a gateway node with Alpine Linux
448
** Change do edge
449
** Enable testing
450
* Update the firewall to allow access from this node to the ceph monitors
451 35 Nico Schottelius
* Setting up the wildcard DNS certificate
452
453
<pre>
454
apk add ceph-radosgw
455
</pre>
456 37 Nico Schottelius
457
h3. Wildcard DNS certificate from letsencrypt
458
459
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
460
461
* run certbot
462
* update DNS with the first token
463
* update DNS with the second token
464
465
Sample session:
466
467
<pre>
468
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
469
-d *.s3.ungleich.ch -d s3.ungleich.ch
470
Saving debug log to /var/log/letsencrypt/letsencrypt.log
471
Plugins selected: Authenticator manual, Installer None
472
Cert is due for renewal, auto-renewing...
473
Renewing an existing certificate
474
Performing the following challenges:
475
dns-01 challenge for s3.ungleich.ch
476
dns-01 challenge for s3.ungleich.ch
477
478
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
479
NOTE: The IP of this machine will be publicly logged as having requested this
480
certificate. If you're running certbot in manual mode on a machine that is not
481
your server, please ensure you're okay with that.
482
483
Are you OK with your IP being logged?
484
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
485
(Y)es/(N)o: y
486
487
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
488
Please deploy a DNS TXT record under the name
489
_acme-challenge.s3.ungleich.ch with the following value:
490
491
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
492
493
Before continuing, verify the record is deployed.
494
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
495
Press Enter to Continue
496
497
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
498
Please deploy a DNS TXT record under the name
499
_acme-challenge.s3.ungleich.ch with the following value:
500
501
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
502
503
Before continuing, verify the record is deployed.
504
(This must be set up in addition to the previous challenges; do not remove,
505
replace, or undo the previous challenge tasks yet. Note that you might be
506
asked to create multiple distinct TXT records with the same name. This is
507
permitted by DNS standards.)
508
509
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
510
Press Enter to Continue
511
Waiting for verification...
512
Cleaning up challenges
513
514
IMPORTANT NOTES:
515
 - Congratulations! Your certificate and chain have been saved at:
516
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
517
   Your key file has been saved at:
518
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
519
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
520
   version of this certificate in the future, simply run certbot
521
   again. To non-interactively renew *all* of your certificates, run
522
   "certbot renew"
523
 - If you like Certbot, please consider supporting our work by:
524
525
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
526
   Donating to EFF:                    https://eff.org/donate-le
527
528
</pre>
529 41 Nico Schottelius
530
h2. Debugging ceph
531
532
533
<pre>
534
    ceph status
535
    ceph osd status
536
    ceph osd df
537
    ceph osd utilization
538
    ceph osd pool stats
539
    ceph osd tree
540
    ceph pg stat
541
</pre>
542 42 Nico Schottelius
543 53 Nico Schottelius
h3. How to list the version overview
544
545 55 Nico Schottelius
This lists the versions of osds, mgrs and mons:
546
547 53 Nico Schottelius
<pre>
548
ceph versions
549
</pre>
550 55 Nico Schottelius
551
Listing the "features" of clients, osds, mgrs and mons can be done using @ceph features@:
552
553
<pre>
554
[15:32:20] red1.place5:~# ceph features
555
{
556
    "mon": [
557
        {
558
            "features": "0x3ffddff8ffecffff",
559
            "release": "luminous",
560
            "num": 5
561
        }
562
    ],
563
    "osd": [
564
        {
565
            "features": "0x3ffddff8ffecffff",
566
            "release": "luminous",
567
            "num": 44
568
        }
569
    ],
570
    "client": [
571
        {
572
            "features": "0x3ffddff8eea4fffb",
573
            "release": "luminous",
574
            "num": 4
575
        },
576
        {
577
            "features": "0x3ffddff8ffacffff",
578
            "release": "luminous",
579
            "num": 18
580
        },
581
        {
582
            "features": "0x3ffddff8ffecffff",
583
            "release": "luminous",
584
            "num": 31
585
        }
586
    ],
587
    "mgr": [
588
        {
589
            "features": "0x3ffddff8ffecffff",
590
            "release": "luminous",
591
            "num": 4
592
        }
593
    ]
594
}
595
596
</pre>
597
 
598 53 Nico Schottelius
599
h3. How to list the version of every OSD and every monitor
600
601
To list the version of each ceph OSD:
602
603
<pre>
604
ceph tell osd.* version
605
</pre>
606
607
To list the version of each ceph mon:
608
2
609
<pre>
610
ceph tell mon.* version
611
</pre>
612
613
The mgr do not seem to support this command as of 14.2.21.
614
615 49 Nico Schottelius
h2. Performance Tuning
616
617
* Ensure that the basic options for reducing rebalancing workload are set:
618
619
<pre>
620
osd max backfills = 1
621
osd recovery max active = 1
622
osd recovery op priority = 2
623
</pre>
624
625
* Ensure that "osd_op_queue_cut_off":https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off is set to **high**
626
** Requires OSD restart on change
627
628 50 Nico Schottelius
<pre>
629
ceph config set global osd_op_queue_cut_off high
630
</pre>
631
632 51 Nico Schottelius
<pre>
633
be sure to check your osd recovery sleep settings, there are several
634
depending on your underlying drives:
635
636
    "osd_recovery_sleep": "0.000000",
637
    "osd_recovery_sleep_hdd": "0.050000",
638
    "osd_recovery_sleep_hybrid": "0.050000",
639
    "osd_recovery_sleep_ssd": "0.050000",
640
641
Adjusting these will upwards will dramatically reduce IO, and take effect
642
immediately at the cost of slowing rebalance/recovery.
643
</pre>
644
645 52 Nico Schottelius
Reference settings from Frank Schilder:
646
647
<pre>
648
  osd       class:hdd      advanced osd_recovery_sleep                0.050000
649
  osd       class:rbd_data advanced osd_recovery_sleep                0.025000
650
  osd       class:rbd_meta advanced osd_recovery_sleep                0.002500
651
  osd       class:ssd      advanced osd_recovery_sleep                0.002500
652
  osd                      advanced osd_recovery_sleep                0.050000
653
654
  osd       class:hdd      advanced osd_max_backfills                 3
655
  osd       class:rbd_data advanced osd_max_backfills                 6
656
  osd       class:rbd_meta advanced osd_max_backfills                 12
657
  osd       class:ssd      advanced osd_max_backfills                 12
658
  osd                      advanced osd_max_backfills                 3
659
660
  osd       class:hdd      advanced osd_recovery_max_active           8
661
  osd       class:rbd_data advanced osd_recovery_max_active           16
662
  osd       class:rbd_meta advanced osd_recovery_max_active           32
663
  osd       class:ssd      advanced osd_recovery_max_active           32
664
  osd                      advanced osd_recovery_max_active           8
665
</pre>
666
667
(have not yet been tested in our clusters)
668 51 Nico Schottelius
669 42 Nico Schottelius
h2. Ceph theory
670
671
h3. How much data per Server?
672
673
Q: How much data should we add into one server?
674
A: Not more than it can handle.
675
676
How much data can a server handle? For this let's have a look at 2 scenarios:
677
678
* How long does it take to compensate the loss of the server?
679
680
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
681
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
682
683
684
h4. Approach 1
685
686
Then
687
688
Let's take an example: 
689
690
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
691
* 100000/1.25 = 80000s = 22.22h
692
693
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
694
695
h4. Approach 2: calculating with left servers
696
697
However we can apply our logic also to distribute
698
the rebuild over several servers that now pull in data from each other for rebuilding.
699
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
700
network connection.
701
702
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
703
704
However how fast can we actually read data from the disks? 
705
706
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
707
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
708
709
 
710
711
712
Further assumptions:
713
714
* Assuming further that each disk should be dedicated at least one CPU core.
715 43 Nico Schottelius
716
h3. Disk/SSD speeds
717
718 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
719 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
720
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
721
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)
722 47 Dominique Roux
723 48 Dominique Roux
h3. Ceph theoretical fundament
724 47 Dominique Roux
725
If you are very much into the theoretical fundament of Ceph check out their "paper":https://www3.nd.edu/~dthain/courses/cse40771/spring2007/papers/ceph.pdf