Project

General

Profile

The ungleich ceph handbook » History » Version 45

Nico Schottelius, 04/18/2021 12:49 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13 45 Nico Schottelius
h2. Processes
14
15
h3. Usage monitoring
16
17
* Usage should be kept somewhere in 70-75% area
18
* If usage reaches 72.5%, we start reducing usage by adding disks
19
* We stop when usage is below 70%
20
21
h3. Phasing in new disks
22
23
* 24h performance test prior to using it
24
25
h3. Phasing in new servers
26
27
* 24h performance test with 1 ssd or 1 hdd (whatever is applicable)
28
29
30 1 Nico Schottelius
h2. Communication guide
31
32
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
33
34
For this reason communicate whenever I/O recovery settings are temporarily tuned.
35
36 20 Nico Schottelius
h2. Analysing 
37
38 21 Nico Schottelius
h3. ceph osd df tree
39 20 Nico Schottelius
40
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
41
42 22 Nico Schottelius
h3. Find out the device of an OSD
43
44
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
45
46
<pre>
47
48
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
49
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
50
</pre>
51
52 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
53 1 Nico Schottelius
54 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
55
56 2 Nico Schottelius
h3. For Dell servers
57
58
First find the disk and then add it to the operating system
59
60
<pre>
61
megacli -PDList -aALL  | grep -B16 -i unconfigur
62
63
# Sample output:
64
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
65
Enclosure Device ID: N/A
66
Slot Number: 0
67
Enclosure position: N/A
68
Device Id: 0
69
WWN: 0000000000000000
70
Sequence Number: 1
71
Media Error Count: 0
72
Other Error Count: 0
73
Predictive Failure Count: 0
74
Last Predictive Failure Event Seq Number: 0
75
PD Type: SATA
76
77
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
78
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
79
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
80
Sector Size:  0
81
Firmware state: Unconfigured(good), Spun Up
82
</pre>
83
84
Then add the disk to the OS:
85
86
<pre>
87 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
88 2 Nico Schottelius
89
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
90
megacli -CfgLdAdd -r0 [32:0] -a0
91
92
# Sample call, if enclosure is N/A
93 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
94 25 Jin-Guk Kwon
</pre>
95
96
Then check disk
97
98
<pre>
99
fdisk -l
100
[11:26:23] server2.place6:~# fdisk -l
101
......
102
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
103
Units: sectors of 1 * 512 = 512 bytes
104
Sector size (logical/physical): 512 bytes / 512 bytes
105
I/O size (minimum/optimal): 512 bytes / 512 bytes
106
[11:27:24] server2.place6:~#
107
</pre>
108
109
Then create gpt
110
111
<pre>
112
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
113
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
114
......
115
Created a new DOS disklabel with disk identifier 0x9c4a0355.
116
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
117
......
118
</pre>
119
120
Then create osd for ssd/hdd-big
121
122
<pre>
123
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
124
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
125
+ set -e
126
+ [ 2 -lt 2 ]
127
......
128
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
129
osd.14
130
[ ok ] Restarting daemon monitor: monit.
131
[11:36:14] server2.place6:~#
132
</pre>
133
134
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
135
136
<pre>
137
ceph -s
138
[12:37:57] server2.place6:~# ceph -s
139
  cluster:
140
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
141
    health: HEALTH_WARN
142
            2248811/49628409 objects misplaced (4.531%)
143
......
144
  io:
145
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
146
    recovery: 27.1MiB/s, 6objects/s
147
[12:49:41] server2.place6:~#
148 2 Nico Schottelius
</pre>
149
150 1 Nico Schottelius
h2. Moving a disk/ssd to another server
151 4 Nico Schottelius
152
(needs to be described better)
153
154
Generally speaking:
155
156 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
157 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
158
** Stop the osd, remove monit on the server you want to take it out
159
** umount the disk
160 1 Nico Schottelius
* Take disk out
161
* Discard preserved cache on the server you took it out 
162 23 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -a0@
163 1 Nico Schottelius
* Insert into new server
164 9 Nico Schottelius
* Clear foreign configuration
165 23 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -a0@
166 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
167
** No creating of the osd required!
168
* Verify that the disk exists and that the osd is started
169
** using *ps aux*
170
** using *ceph osd tree*
171 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
172 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
173
** Reload monit
174 11 Nico Schottelius
* Verify monit using *monit status*
175 1 Nico Schottelius
176
h2. Removing a disk/ssd
177 5 Nico Schottelius
178
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
179 1 Nico Schottelius
180
h2. Handling DOWN osds with filesystem errors
181
182
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
183
184
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
185
* Check **ceph -s**, find host using **ceph osd tree**
186
* Login to the affected host
187
* Run the following commands:
188
** ls /var/lib/ceph/osd/ceph-XX
189
** dmesg
190 24 Jin-Guk Kwon
<pre>
191
ex) After checking message of dmesg, you can do next step
192
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
193
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
194
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
195
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
196
</pre>
197
198 1 Nico Schottelius
* Create a new ticket in the datacenter light project
199
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
200
** Add (partial) output of above commands
201
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
202
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
203
*** Create a short letter to the vendor, including technical details a from above
204
*** Record when you sent it in
205
*** Put ticket into status waiting
206
** If there is no warranty, dispose it
207
208 40 Jin-Guk Kwon
h2. [[Create new pool and place new osd]]
209 39 Jin-Guk Kwon
210 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
211
212
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
213
214
The default configuration on our servers contains:
215
216
<pre>
217
[osd]
218
osd max backfills = 1
219
osd recovery max active = 1
220
osd recovery op priority = 2
221
</pre>
222
223
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
224
225
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
226
227
<pre>
228
ceph tell osd.* injectargs '--osd-max-backfills Y'
229
ceph tell osd.* injectargs '--osd-recovery-max-active X'
230
</pre>
231
232
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
233
234
h2. Debug scrub errors / inconsistent pg message
235 6 Nico Schottelius
236 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
237
238
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
239 12 Nico Schottelius
240
h2. Move servers into the osd tree
241
242
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
243
Output might look as follows:
244
245
<pre>
246
[11:19:27] server5.place6:~# ceph osd tree
247
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
248
 -3           0.87270 host server5                             
249
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
250
 -1         251.85580 root default                             
251
 -7          81.56271     host server2                         
252
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
253
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
254
...
255
</pre>
256
257
258
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
259
which will move the bucket in the right place:
260
261
<pre>
262
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
263
moved item id -3 name 'server5' to location {root=default} in crush map
264
[11:32:12] server5.place6:~# ceph osd tree
265
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
266
 -1         252.72850 root default                             
267
...
268
 -3           0.87270     host server5                         
269
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
270
271
272
</pre>
273 13 Nico Schottelius
274
h2. How to fix existing osds with wrong partition layout
275
276
In the first version of DCL we used filestore/3 partition based layout.
277
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
278
279
To convert, we delete the old OSD, clean the partitions and create a new osd:
280
281 14 Nico Schottelius
h3. Inactive OSD
282 1 Nico Schottelius
283 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
284
285 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
286
287
<pre>
288
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
289
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
290
0
291
root@server2:/opt/ungleich-tools# umount  /mnt/
292
293
</pre>
294
295
* Verify in the *ceph osd tree* that the OSD is on that server
296
* Deleting the OSD
297
** ceph osd crush remove $osd_name
298 1 Nico Schottelius
** ceph osd rm $osd_name
299 14 Nico Schottelius
300
Then continue below as described in "Recreating the OSD".
301
302
h3. Remove Active OSD
303
304
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
305
* Then continue below as described in "Recreating the OSD".
306
307
308
h3. Recreating the OSD
309
310 13 Nico Schottelius
* Create an empty partition table
311
** fdisk /dev/sdX
312
** g
313
** w
314
* Create a new OSD
315
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
316 15 Jin-Guk Kwon
317
h2. How to fix unfound pg
318
319
refer to https://redmine.ungleich.ch/issues/6388
320 16 Jin-Guk Kwon
321
* Check health state 
322
** ceph health detail
323
* Check which server has that osd
324
** ceph osd tree
325
* Check which VM is running in server place
326 17 Jin-Guk Kwon
** virsh list  
327 16 Jin-Guk Kwon
* Check pg map
328 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
329 18 Jin-Guk Kwon
* revert pg
330
** ceph pg [PGID] mark_unfound_lost revert
331 28 Nico Schottelius
332
h2. Enabling per image RBD statistics for prometheus
333
334
335
<pre>
336
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
337
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
338
</pre>
339 29 Nico Schottelius
340
h2. S3 Object Storage
341
342 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
343
344 29 Nico Schottelius
h3. Introduction
345 1 Nico Schottelius
346 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
347
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
348 29 Nico Schottelius
349
h3. Architecture
350
351
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
352 34 Nico Schottelius
* s3 buckets are usually
353 29 Nico Schottelius
354 32 Nico Schottelius
h3. Authentication / Users
355
356
* Ceph *can* make use of LDAP as a backend
357 1 Nico Schottelius
** However it uses the clear text username+password as a token
358 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
359 32 Nico Schottelius
* We do not want users to store their regular account on machines
360
* For this reason we use independent users / tokens, but with the same username as in LDAP
361
362 38 Nico Schottelius
Creating a user:
363
364
<pre>
365
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
366
</pre>
367
368
369
Listing users:
370
371
<pre>
372
radosgw-admin user list
373
</pre>
374
375
376
Deleting users and their storage:
377
378
<pre>
379
radosgw-admin user rm --uid=USERNAME --purge-data
380
</pre>
381
382 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
383 33 Nico Schottelius
384
* Setup a gateway node with Alpine Linux
385
** Change do edge
386
** Enable testing
387
* Update the firewall to allow access from this node to the ceph monitors
388 35 Nico Schottelius
* Setting up the wildcard DNS certificate
389
390
<pre>
391
apk add ceph-radosgw
392
</pre>
393 37 Nico Schottelius
394
h3. Wildcard DNS certificate from letsencrypt
395
396
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
397
398
* run certbot
399
* update DNS with the first token
400
* update DNS with the second token
401
402
Sample session:
403
404
<pre>
405
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
406
-d *.s3.ungleich.ch -d s3.ungleich.ch
407
Saving debug log to /var/log/letsencrypt/letsencrypt.log
408
Plugins selected: Authenticator manual, Installer None
409
Cert is due for renewal, auto-renewing...
410
Renewing an existing certificate
411
Performing the following challenges:
412
dns-01 challenge for s3.ungleich.ch
413
dns-01 challenge for s3.ungleich.ch
414
415
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
416
NOTE: The IP of this machine will be publicly logged as having requested this
417
certificate. If you're running certbot in manual mode on a machine that is not
418
your server, please ensure you're okay with that.
419
420
Are you OK with your IP being logged?
421
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
422
(Y)es/(N)o: y
423
424
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
425
Please deploy a DNS TXT record under the name
426
_acme-challenge.s3.ungleich.ch with the following value:
427
428
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
429
430
Before continuing, verify the record is deployed.
431
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
432
Press Enter to Continue
433
434
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
435
Please deploy a DNS TXT record under the name
436
_acme-challenge.s3.ungleich.ch with the following value:
437
438
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
439
440
Before continuing, verify the record is deployed.
441
(This must be set up in addition to the previous challenges; do not remove,
442
replace, or undo the previous challenge tasks yet. Note that you might be
443
asked to create multiple distinct TXT records with the same name. This is
444
permitted by DNS standards.)
445
446
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
447
Press Enter to Continue
448
Waiting for verification...
449
Cleaning up challenges
450
451
IMPORTANT NOTES:
452
 - Congratulations! Your certificate and chain have been saved at:
453
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
454
   Your key file has been saved at:
455
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
456
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
457
   version of this certificate in the future, simply run certbot
458
   again. To non-interactively renew *all* of your certificates, run
459
   "certbot renew"
460
 - If you like Certbot, please consider supporting our work by:
461
462
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
463
   Donating to EFF:                    https://eff.org/donate-le
464
465
</pre>
466 41 Nico Schottelius
467
h2. Debugging ceph
468
469
470
<pre>
471
    ceph status
472
    ceph osd status
473
    ceph osd df
474
    ceph osd utilization
475
    ceph osd pool stats
476
    ceph osd tree
477
    ceph pg stat
478
</pre>
479 42 Nico Schottelius
480
h2. Ceph theory
481
482
h3. How much data per Server?
483
484
Q: How much data should we add into one server?
485
A: Not more than it can handle.
486
487
How much data can a server handle? For this let's have a look at 2 scenarios:
488
489
* How long does it take to compensate the loss of the server?
490
491
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
492
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
493
494
495
h4. Approach 1
496
497
Then
498
499
Let's take an example: 
500
501
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
502
* 100000/1.25 = 80000s = 22.22h
503
504
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
505
506
h4. Approach 2: calculating with left servers
507
508
However we can apply our logic also to distribute
509
the rebuild over several servers that now pull in data from each other for rebuilding.
510
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
511
network connection.
512
513
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
514
515
However how fast can we actually read data from the disks? 
516
517
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
518
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
519
520
 
521
522
523
Further assumptions:
524
525
* Assuming further that each disk should be dedicated at least one CPU core.
526 43 Nico Schottelius
527
h3. Disk/SSD speeds
528
529 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
530 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
531
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
532
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)