Project

General

Profile

The ungleich ceph handbook » History » Version 25

Jin-Guk Kwon, 05/23/2019 01:16 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 20 Nico Schottelius
h2. Analysing 
20
21 21 Nico Schottelius
h3. ceph osd df tree
22 20 Nico Schottelius
23
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
24
25 22 Nico Schottelius
h3. Find out the device of an OSD
26
27
Use @mount | grep /var/lib/ceph/osd/ceph-OSDID@ on the server on which the OSD is located:
28
29
<pre>
30
31
[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31
32
/dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)
33
</pre>
34
35 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
36 1 Nico Schottelius
37 25 Jin-Guk Kwon
write on the disks, which order / date we bought it with a permanent marker.
38
39 2 Nico Schottelius
h3. For Dell servers
40
41
First find the disk and then add it to the operating system
42
43
<pre>
44
megacli -PDList -aALL  | grep -B16 -i unconfigur
45
46
# Sample output:
47
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
48
Enclosure Device ID: N/A
49
Slot Number: 0
50
Enclosure position: N/A
51
Device Id: 0
52
WWN: 0000000000000000
53
Sequence Number: 1
54
Media Error Count: 0
55
Other Error Count: 0
56
Predictive Failure Count: 0
57
Last Predictive Failure Event Seq Number: 0
58
PD Type: SATA
59
60
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
61
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
62
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
63
Sector Size:  0
64
Firmware state: Unconfigured(good), Spun Up
65
</pre>
66
67
Then add the disk to the OS:
68
69
<pre>
70 19 Jin-Guk Kwon
megacli -CfgLdAdd -r0 [enclosure position:slot] -aX (X : host is 0. marray is 1)
71 2 Nico Schottelius
72
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
73
megacli -CfgLdAdd -r0 [32:0] -a0
74
75
# Sample call, if enclosure is N/A
76 1 Nico Schottelius
megacli -CfgLdAdd -r0 [:0] -a0
77 25 Jin-Guk Kwon
</pre>
78
79
Then check disk
80
81
<pre>
82
fdisk -l
83
[11:26:23] server2.place6:~# fdisk -l
84
......
85
Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors
86
Units: sectors of 1 * 512 = 512 bytes
87
Sector size (logical/physical): 512 bytes / 512 bytes
88
I/O size (minimum/optimal): 512 bytes / 512 bytes
89
[11:27:24] server2.place6:~#
90
</pre>
91
92
Then create gpt
93
94
<pre>
95
/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX
96
[11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh
97
......
98
Created a new DOS disklabel with disk identifier 0x9c4a0355.
99
Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E).
100
......
101
</pre>
102
103
Then create osd for ssd/hdd-big
104
105
<pre>
106
/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big)
107
[11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big
108
+ set -e
109
+ [ 2 -lt 2 ]
110
......
111
+ /opt/ungleich-tools/monit-ceph-create-start osd.14
112
osd.14
113
[ ok ] Restarting daemon monitor: monit.
114
[11:36:14] server2.place6:~#
115
</pre>
116
117
Then check rebalancing(if you want to add another disk, you should do after rebalancing)
118
119
<pre>
120
ceph -s
121
[12:37:57] server2.place6:~# ceph -s
122
  cluster:
123
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
124
    health: HEALTH_WARN
125
            2248811/49628409 objects misplaced (4.531%)
126
......
127
  io:
128
    client:   170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr
129
    recovery: 27.1MiB/s, 6objects/s
130
[12:49:41] server2.place6:~#
131 2 Nico Schottelius
</pre>
132
133 1 Nico Schottelius
h2. Moving a disk/ssd to another server
134 4 Nico Schottelius
135
(needs to be described better)
136
137
Generally speaking:
138
139 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
140
** Stop the osd, remove monit on the server you want to take it out
141
** umount the disk
142 1 Nico Schottelius
* Take disk out
143
* Discard preserved cache on the server you took it out 
144 23 Nico Schottelius
** using megacli:  @megacli -DiscardPreservedCache -Lall -a0@
145 1 Nico Schottelius
* Insert into new server
146 9 Nico Schottelius
* Clear foreign configuration
147 23 Nico Schottelius
** using megacli: @megacli -CfgForeign -Clear -a0@
148 9 Nico Schottelius
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
149
** No creating of the osd required!
150
* Verify that the disk exists and that the osd is started
151
** using *ps aux*
152
** using *ceph osd tree*
153 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
154 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
155
** Reload monit
156 11 Nico Schottelius
* Verify monit using *monit status*
157 1 Nico Schottelius
158
h2. Removing a disk/ssd
159 5 Nico Schottelius
160
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
161 1 Nico Schottelius
162
h2. Handling DOWN osds with filesystem errors
163
164
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
165
166
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
167
* Check **ceph -s**, find host using **ceph osd tree**
168
* Login to the affected host
169
* Run the following commands:
170
** ls /var/lib/ceph/osd/ceph-XX
171
** dmesg
172 24 Jin-Guk Kwon
<pre>
173
ex) After checking message of dmesg, you can do next step
174
[204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64
175
[204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08eb612
176
[204696.410702] XFS (sdl1): Log I/O Error Detected.  Shutting down filesystem
177
[204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(
178
</pre>
179
180 1 Nico Schottelius
* Create a new ticket in the datacenter light project
181
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
182
** Add (partial) output of above commands
183
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
184
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
185
*** Create a short letter to the vendor, including technical details a from above
186
*** Record when you sent it in
187
*** Put ticket into status waiting
188
** If there is no warranty, dispose it
189
190
h2. Change ceph speed for i/o recovery
191
192
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
193
194
The default configuration on our servers contains:
195
196
<pre>
197
[osd]
198
osd max backfills = 1
199
osd recovery max active = 1
200
osd recovery op priority = 2
201
</pre>
202
203
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
204
205
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
206
207
<pre>
208
ceph tell osd.* injectargs '--osd-max-backfills Y'
209
ceph tell osd.* injectargs '--osd-recovery-max-active X'
210
</pre>
211
212
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
213
214
h2. Debug scrub errors / inconsistent pg message
215 6 Nico Schottelius
216 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
217
218
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
219 12 Nico Schottelius
220
h2. Move servers into the osd tree
221
222
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
223
Output might look as follows:
224
225
<pre>
226
[11:19:27] server5.place6:~# ceph osd tree
227
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
228
 -3           0.87270 host server5                             
229
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
230
 -1         251.85580 root default                             
231
 -7          81.56271     host server2                         
232
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
233
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
234
...
235
</pre>
236
237
238
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
239
which will move the bucket in the right place:
240
241
<pre>
242
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
243
moved item id -3 name 'server5' to location {root=default} in crush map
244
[11:32:12] server5.place6:~# ceph osd tree
245
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
246
 -1         252.72850 root default                             
247
...
248
 -3           0.87270     host server5                         
249
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
250
251
252
</pre>
253 13 Nico Schottelius
254
h2. How to fix existing osds with wrong partition layout
255
256
In the first version of DCL we used filestore/3 partition based layout.
257
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
258
259
To convert, we delete the old OSD, clean the partitions and create a new osd:
260
261 14 Nico Schottelius
h3. Inactive OSD
262 1 Nico Schottelius
263 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
264
265 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
266
267
<pre>
268
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
269
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
270
0
271
root@server2:/opt/ungleich-tools# umount  /mnt/
272
273
</pre>
274
275
* Verify in the *ceph osd tree* that the OSD is on that server
276
* Deleting the OSD
277
** ceph osd crush remove $osd_name
278 1 Nico Schottelius
** ceph osd rm $osd_name
279 14 Nico Schottelius
280
Then continue below as described in "Recreating the OSD".
281
282
h3. Remove Active OSD
283
284
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
285
* Then continue below as described in "Recreating the OSD".
286
287
288
h3. Recreating the OSD
289
290 13 Nico Schottelius
* Create an empty partition table
291
** fdisk /dev/sdX
292
** g
293
** w
294
* Create a new OSD
295
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
296 15 Jin-Guk Kwon
297
h2. How to fix unfound pg
298
299
refer to https://redmine.ungleich.ch/issues/6388
300 16 Jin-Guk Kwon
301
* Check health state 
302
** ceph health detail
303
* Check which server has that osd
304
** ceph osd tree
305
* Check which VM is running in server place
306 17 Jin-Guk Kwon
** virsh list  
307 16 Jin-Guk Kwon
* Check pg map
308 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
309 18 Jin-Guk Kwon
* revert pg
310
** ceph pg [PGID] mark_unfound_lost revert