Project

General

Profile

The ungleich ceph handbook » History » Version 44

Nico Schottelius, 09/28/2020 01:56 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 20 Nico Schottelius
h2. Analysing 
20
21 21 Nico Schottelius
h3. ceph osd df tree
22 20 Nico Schottelius
23
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
24
25 22 Nico Schottelius
h3. Find out the device of an OSD
26
27
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
28
29
<pre>
30
31
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
32
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
33
</pre>
34
35 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
36 1 Nico Schottelius
37 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
38
39 2 Nico Schottelius
h3. For Dell servers
40
41
First find the disk and then add it to the operating system
42
43
<pre>
44
megacli -PDList -aALL  | grep -B16 -i unconfigur
45
46
# Sample output:
47
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
48
Enclosure Device ID: N/A
49
Slot Number: 0
50
Enclosure position: N/A
51
Device Id: 0
52
WWN: 0000000000000000
53
Sequence Number: 1
54
Media Error Count: 0
55
Other Error Count: 0
56
Predictive Failure Count: 0
57
Last Predictive Failure Event Seq Number: 0
58
PD Type: SATA
59
60
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
61
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
62
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
63
Sector Size:  0
64
Firmware state: Unconfigured(good), Spun Up
65
</pre>
66
67
Then add the disk to the OS:
68
69
<pre>
70 26 ll nu
megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1)
71 2 Nico Schottelius
72
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
73
megacli -CfgLdAdd -r0 [32:0] -a0
74
75
# Sample call, if enclosure is N/A
76 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
77 25 Jin-Guk Kwon
</pre>
78
79
Then check disk
80
81
<pre>
82
fdisk -l
83
[11:26:23] server2.place6:~# fdisk -l
84
......
85
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
86
Units: sectors of 1 * 512 = 512 bytes
87
Sector size (logical/physical): 512 bytes / 512 bytes
88
I/O size (minimum/optimal): 512 bytes / 512 bytes
89
[11:27:24] server2.place6:~#
90
</pre>
91
92
Then create gpt
93
94
<pre>
95
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
96
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
97
......
98
Created a new DOS disklabel with disk identifier 0x9c4a0355.
99
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
100
......
101
</pre>
102
103
Then create osd for ssd/hdd-big
104
105
<pre>
106
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
107
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
108
+ set -e
109
+ [ 2 -lt 2 ]
110
......
111
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
112
osd.14
113
[ ok ] Restarting daemon monitor: monit.
114
[11:36:14] server2.place6:~#
115
</pre>
116
117
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
118
119
<pre>
120
ceph -s
121
[12:37:57] server2.place6:~# ceph -s
122
  cluster:
123
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
124
    health: HEALTH_WARN
125
            2248811/49628409 objects misplaced (4.531%)
126
......
127
  io:
128
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
129
    recovery: 27.1MiB/s, 6objects/s
130
[12:49:41] server2.place6:~#
131 2 Nico Schottelius
</pre>
132
133 1 Nico Schottelius
h2. Moving a disk/ssd to another server
134 4 Nico Schottelius
135
(needs to be described better)
136
137
Generally speaking:
138
139 27 ll nu
* //needs to be tested: disable recovery so data wont start move while you have the osd down
140 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
141
** Stop the osd, remove monit on the server you want to take it out
142
** umount the disk
143 1 Nico Schottelius
* Take disk out
144
* Discard preserved cache on the server you took it out 
145 23 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -a0@
146 1 Nico Schottelius
* Insert into new server
147 9 Nico Schottelius
* Clear foreign configuration
148 23 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -a0@
149 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
150
** No creating of the osd required!
151
* Verify that the disk exists and that the osd is started
152
** using *ps aux*
153
** using *ceph osd tree*
154 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
155 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
156
** Reload monit
157 11 Nico Schottelius
* Verify monit using *monit status*
158 1 Nico Schottelius
159
h2. Removing a disk/ssd
160 5 Nico Schottelius
161
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
162 1 Nico Schottelius
163
h2. Handling DOWN osds with filesystem errors
164
165
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
166
167
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
168
* Check **ceph -s**, find host using **ceph osd tree**
169
* Login to the affected host
170
* Run the following commands:
171
** ls /var/lib/ceph/osd/ceph-XX
172
** dmesg
173 24 Jin-Guk Kwon
<pre>
174
ex) After checking message of dmesg, you can do next step
175
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
176
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
177
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
178
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
179
</pre>
180
181 1 Nico Schottelius
* Create a new ticket in the datacenter light project
182
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
183
** Add (partial) output of above commands
184
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
185
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
186
*** Create a short letter to the vendor, including technical details a from above
187
*** Record when you sent it in
188
*** Put ticket into status waiting
189
** If there is no warranty, dispose it
190
191 40 Jin-Guk Kwon
h2. [[Create new pool and place new osd]]
192 39 Jin-Guk Kwon
193 1 Nico Schottelius
h2. Change ceph speed for i/o recovery
194
195
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
196
197
The default configuration on our servers contains:
198
199
<pre>
200
[osd]
201
osd max backfills = 1
202
osd recovery max active = 1
203
osd recovery op priority = 2
204
</pre>
205
206
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
207
208
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
209
210
<pre>
211
ceph tell osd.* injectargs '--osd-max-backfills Y'
212
ceph tell osd.* injectargs '--osd-recovery-max-active X'
213
</pre>
214
215
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
216
217
h2. Debug scrub errors / inconsistent pg message
218 6 Nico Schottelius
219 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
220
221
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
222 12 Nico Schottelius
223
h2. Move servers into the osd tree
224
225
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
226
Output might look as follows:
227
228
<pre>
229
[11:19:27] server5.place6:~# ceph osd tree
230
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
231
 -3           0.87270 host server5                             
232
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
233
 -1         251.85580 root default                             
234
 -7          81.56271     host server2                         
235
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
236
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
237
...
238
</pre>
239
240
241
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
242
which will move the bucket in the right place:
243
244
<pre>
245
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
246
moved item id -3 name 'server5' to location {root=default} in crush map
247
[11:32:12] server5.place6:~# ceph osd tree
248
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
249
 -1         252.72850 root default                             
250
...
251
 -3           0.87270     host server5                         
252
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
253
254
255
</pre>
256 13 Nico Schottelius
257
h2. How to fix existing osds with wrong partition layout
258
259
In the first version of DCL we used filestore/3 partition based layout.
260
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
261
262
To convert, we delete the old OSD, clean the partitions and create a new osd:
263
264 14 Nico Schottelius
h3. Inactive OSD
265 1 Nico Schottelius
266 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
267
268 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
269
270
<pre>
271
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
272
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
273
0
274
root@server2:/opt/ungleich-tools# umount  /mnt/
275
276
</pre>
277
278
* Verify in the *ceph osd tree* that the OSD is on that server
279
* Deleting the OSD
280
** ceph osd crush remove $osd_name
281 1 Nico Schottelius
** ceph osd rm $osd_name
282 14 Nico Schottelius
283
Then continue below as described in "Recreating the OSD".
284
285
h3. Remove Active OSD
286
287
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
288
* Then continue below as described in "Recreating the OSD".
289
290
291
h3. Recreating the OSD
292
293 13 Nico Schottelius
* Create an empty partition table
294
** fdisk /dev/sdX
295
** g
296
** w
297
* Create a new OSD
298
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
299 15 Jin-Guk Kwon
300
h2. How to fix unfound pg
301
302
refer to https://redmine.ungleich.ch/issues/6388
303 16 Jin-Guk Kwon
304
* Check health state 
305
** ceph health detail
306
* Check which server has that osd
307
** ceph osd tree
308
* Check which VM is running in server place
309 17 Jin-Guk Kwon
** virsh list  
310 16 Jin-Guk Kwon
* Check pg map
311 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
312 18 Jin-Guk Kwon
* revert pg
313
** ceph pg [PGID] mark_unfound_lost revert
314 28 Nico Schottelius
315
h2. Enabling per image RBD statistics for prometheus
316
317
318
<pre>
319
[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd"
320
[20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"
321
</pre>
322 29 Nico Schottelius
323
h2. S3 Object Storage
324
325 36 Nico Schottelius
This section is ** UNDER CONTRUCTION ** 
326
327 29 Nico Schottelius
h3. Introduction
328 1 Nico Schottelius
329 30 Nico Schottelius
* See the "Red Hat manual":https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/overview-rgw
330
* The "ceph docs about object storage":https://docs.ceph.com/docs/mimic/radosgw/
331 29 Nico Schottelius
332
h3. Architecture
333
334
* S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.
335 34 Nico Schottelius
* s3 buckets are usually
336 29 Nico Schottelius
337 32 Nico Schottelius
h3. Authentication / Users
338
339
* Ceph *can* make use of LDAP as a backend
340 1 Nico Schottelius
** However it uses the clear text username+password as a token
341 34 Nico Schottelius
** See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/
342 32 Nico Schottelius
* We do not want users to store their regular account on machines
343
* For this reason we use independent users / tokens, but with the same username as in LDAP
344
345 38 Nico Schottelius
Creating a user:
346
347
<pre>
348
radosgw-admin user create --uid=USERNAME --display-name="Name of user"
349
</pre>
350
351
352
Listing users:
353
354
<pre>
355
radosgw-admin user list
356
</pre>
357
358
359
Deleting users and their storage:
360
361
<pre>
362
radosgw-admin user rm --uid=USERNAME --purge-data
363
</pre>
364
365 1 Nico Schottelius
h3. Setting up S3 object storage on Ceph
366 33 Nico Schottelius
367
* Setup a gateway node with Alpine Linux
368
** Change do edge
369
** Enable testing
370
* Update the firewall to allow access from this node to the ceph monitors
371 35 Nico Schottelius
* Setting up the wildcard DNS certificate
372
373
<pre>
374
apk add ceph-radosgw
375
</pre>
376 37 Nico Schottelius
377
h3. Wildcard DNS certificate from letsencrypt
378
379
Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.
380
381
* run certbot
382
* update DNS with the first token
383
* update DNS with the second token
384
385
Sample session:
386
387
<pre>
388
s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos 
389
-d *.s3.ungleich.ch -d s3.ungleich.ch
390
Saving debug log to /var/log/letsencrypt/letsencrypt.log
391
Plugins selected: Authenticator manual, Installer None
392
Cert is due for renewal, auto-renewing...
393
Renewing an existing certificate
394
Performing the following challenges:
395
dns-01 challenge for s3.ungleich.ch
396
dns-01 challenge for s3.ungleich.ch
397
398
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
399
NOTE: The IP of this machine will be publicly logged as having requested this
400
certificate. If you're running certbot in manual mode on a machine that is not
401
your server, please ensure you're okay with that.
402
403
Are you OK with your IP being logged?
404
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
405
(Y)es/(N)o: y
406
407
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
408
Please deploy a DNS TXT record under the name
409
_acme-challenge.s3.ungleich.ch with the following value:
410
411
KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0
412
413
Before continuing, verify the record is deployed.
414
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
415
Press Enter to Continue
416
417
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
418
Please deploy a DNS TXT record under the name
419
_acme-challenge.s3.ungleich.ch with the following value:
420
421
bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI
422
423
Before continuing, verify the record is deployed.
424
(This must be set up in addition to the previous challenges; do not remove,
425
replace, or undo the previous challenge tasks yet. Note that you might be
426
asked to create multiple distinct TXT records with the same name. This is
427
permitted by DNS standards.)
428
429
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
430
Press Enter to Continue
431
Waiting for verification...
432
Cleaning up challenges
433
434
IMPORTANT NOTES:
435
 - Congratulations! Your certificate and chain have been saved at:
436
   /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem
437
   Your key file has been saved at:
438
   /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem
439
   Your cert will expire on 2020-12-09. To obtain a new or tweaked
440
   version of this certificate in the future, simply run certbot
441
   again. To non-interactively renew *all* of your certificates, run
442
   "certbot renew"
443
 - If you like Certbot, please consider supporting our work by:
444
445
   Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
446
   Donating to EFF:                    https://eff.org/donate-le
447
448
</pre>
449 41 Nico Schottelius
450
h2. Debugging ceph
451
452
453
<pre>
454
    ceph status
455
    ceph osd status
456
    ceph osd df
457
    ceph osd utilization
458
    ceph osd pool stats
459
    ceph osd tree
460
    ceph pg stat
461
</pre>
462 42 Nico Schottelius
463
h2. Ceph theory
464
465
h3. How much data per Server?
466
467
Q: How much data should we add into one server?
468
A: Not more than it can handle.
469
470
How much data can a server handle? For this let's have a look at 2 scenarios:
471
472
* How long does it take to compensate the loss of the server?
473
474
* Assuming a server has X TiB storage in Y disks attached and a network speed of Z GiB/s.
475
* And our estimated rebuild goal is to compensate the loss of a server within U hours.
476
477
478
h4. Approach 1
479
480
Then
481
482
Let's take an example: 
483
484
* A server with @10 disks * 10 TiB@ = 100 TiB = 100 000 GiB data. It is network connected with 10 Gbit = 1.25 GiB/s.
485
* 100000/1.25 = 80000s = 22.22h
486
487
However, our logic assumes that we actually rebuild from the failed server, which... is failed. 
488
489
h4. Approach 2: calculating with left servers
490
491
However we can apply our logic also to distribute
492
the rebuild over several servers that now pull in data from each other for rebuilding.
493
We need to *read* the data (100TiB) from other servers and distribute it to new OSDs. Assuming each server has a 10 Gbit/s
494
network connection.
495
496
Now the servers might need to *read* (get data from other osds) and *write) (send data to other osds). Luckily, networking is 10 Gbit/s duplex - i.e. in both directions.
497
498
However how fast can we actually read data from the disks? 
499
500
* SSDs are in the range of hundreds of MB/s (best case, not necessarily true for random reads) - let's assume
501
* HDDs are in the range of tenths of MB/s (depending on the work load, but 30-40 MB/s random reads seems realistic)
502
503
 
504
505
506
Further assumptions:
507
508
* Assuming further that each disk should be dedicated at least one CPU core.
509 43 Nico Schottelius
510
h3. Disk/SSD speeds
511
512 44 Nico Schottelius
* Tuning for #8473 showed that a 10TB HDD can write up to 180-200MB/s when backfilling (at about 70% cpu usage and 20% disk usage), max backfills = 8
513 43 Nico Schottelius
* Debugging SSD usage in #8461 showed SSDs can read about 470-520MB/s sequential
514
* Debugging SSD usage in #8461 showed SSDs can write about 170-280MB/s sequential
515
* Debugging SSD usage in #8461 showed SSDs can write about 4MB/s RANDOM (need to verify this even though 3 runs showed these numbers)