Project

General

Profile

The ungleich ceph handbook » History » Version 21

Nico Schottelius, 02/26/2019 02:17 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7 7 Nico Schottelius
This document is **IN PRODUCTION**.
8 1 Nico Schottelius
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 20 Nico Schottelius
h2. Analysing 
20
21 21 Nico Schottelius
h3. ceph osd df tree
22 20 Nico Schottelius
23
Using @ceph osd df tree@ you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.
24
25 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
26
27
h3. For Dell servers
28
29
First find the disk and then add it to the operating system
30
31
<pre>
32
megacli -PDList -aALL  | grep -B16 -i unconfigur
33
34
# Sample output:
35
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
36
Enclosure Device ID: N/A
37
Slot Number: 0
38
Enclosure position: N/A
39
Device Id: 0
40
WWN: 0000000000000000
41
Sequence Number: 1
42
Media Error Count: 0
43
Other Error Count: 0
44
Predictive Failure Count: 0
45
Last Predictive Failure Event Seq Number: 0
46
PD Type: SATA
47
48
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
49
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
50
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
51
Sector Size:  0
52
Firmware state: Unconfigured(good), Spun Up
53
</pre>
54
55
Then add the disk to the OS:
56
57
<pre>
58 19 Jin-Guk Kwon
megacli -CfgLdAdd -r0 [enclosure position:slot] -aX (X : host is 0. marray is 1)
59 2 Nico Schottelius
60
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
61
megacli -CfgLdAdd -r0 [32:0] -a0
62
63
# Sample call, if enclosure is N/A
64
megacli -CfgLdAdd -r0 [:0] -a0
65
</pre>
66
67 1 Nico Schottelius
h2. Moving a disk/ssd to another server
68 4 Nico Schottelius
69
(needs to be described better)
70
71
Generally speaking:
72
73 9 Nico Schottelius
* /opt/ungleich-tools/ceph-osd-stop-disable does the following:
74
** Stop the osd, remove monit on the server you want to take it out
75
** umount the disk
76 1 Nico Schottelius
* Take disk out
77
* Discard preserved cache on the server you took it out 
78 9 Nico Schottelius
** using megacli
79 1 Nico Schottelius
* Insert into new server
80 9 Nico Schottelius
* Clear foreign configuration
81
** using megacli
82
* Disk will now appear in the OS, ceph/udev will automatically start the OSD (!)
83
** No creating of the osd required!
84
* Verify that the disk exists and that the osd is started
85
** using *ps aux*
86
** using *ceph osd tree*
87 10 Nico Schottelius
* */opt/ungleich-tools/monit-ceph-create-start osd.XX* # where osd.XX is the osd + number
88 9 Nico Schottelius
** Creates the monit configuration file so that monit watches the OSD
89
** Reload monit
90 11 Nico Schottelius
* Verify monit using *monit status*
91 1 Nico Schottelius
92
h2. Removing a disk/ssd
93 5 Nico Schottelius
94
To permanently remove a failed disk from a cluster, use ***ceph-osd-stop-remove-permanently*** from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.
95 1 Nico Schottelius
96
h2. Handling DOWN osds with filesystem errors
97
98
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
99
100
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
101
* Check **ceph -s**, find host using **ceph osd tree**
102
* Login to the affected host
103
* Run the following commands:
104
** ls /var/lib/ceph/osd/ceph-XX
105
** dmesg
106
* Create a new ticket in the datacenter light project
107
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
108
** Add (partial) output of above commands
109
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
110
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
111
*** Create a short letter to the vendor, including technical details a from above
112
*** Record when you sent it in
113
*** Put ticket into status waiting
114
** If there is no warranty, dispose it
115
116
117
118
h2. Change ceph speed for i/o recovery
119
120
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
121
122
The default configuration on our servers contains:
123
124
<pre>
125
[osd]
126
osd max backfills = 1
127
osd recovery max active = 1
128
osd recovery op priority = 2
129
</pre>
130
131
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
132
133
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
134
135
<pre>
136
ceph tell osd.* injectargs '--osd-max-backfills Y'
137
ceph tell osd.* injectargs '--osd-recovery-max-active X'
138
</pre>
139
140
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
141
142
h2. Debug scrub errors / inconsistent pg message
143 6 Nico Schottelius
144 1 Nico Schottelius
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a ***ceph pg repair <number>*** fixes the problem.
145
146
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.
147 12 Nico Schottelius
148
h2. Move servers into the osd tree
149
150
New servers have their buckets placed outside the **default root** and thus need to be moved inside.
151
Output might look as follows:
152
153
<pre>
154
[11:19:27] server5.place6:~# ceph osd tree
155
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
156
 -3           0.87270 host server5                             
157
 41     ssd   0.87270     osd.41           up  1.00000 1.00000 
158
 -1         251.85580 root default                             
159
 -7          81.56271     host server2                         
160
  0 hdd-big   9.09511         osd.0        up  1.00000 1.00000 
161
  5 hdd-big   9.09511         osd.5        up  1.00000 1.00000 
162
...
163
</pre>
164
165
166
Use **ceph osd crush move serverX root=default** (where serverX is the new server),
167
which will move the bucket in the right place:
168
169
<pre>
170
[11:21:17] server5.place6:~# ceph osd crush move server5 root=default
171
moved item id -3 name 'server5' to location {root=default} in crush map
172
[11:32:12] server5.place6:~# ceph osd tree
173
ID  CLASS   WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF 
174
 -1         252.72850 root default                             
175
...
176
 -3           0.87270     host server5                         
177
 41     ssd   0.87270         osd.41       up  1.00000 1.00000 
178
179
180
</pre>
181 13 Nico Schottelius
182
h2. How to fix existing osds with wrong partition layout
183
184
In the first version of DCL we used filestore/3 partition based layout.
185
In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.
186
187
To convert, we delete the old OSD, clean the partitions and create a new osd:
188
189 14 Nico Schottelius
h3. Inactive OSD
190 1 Nico Schottelius
191 14 Nico Schottelius
If the OSD is *not active*, we can do the following:
192
193 13 Nico Schottelius
* Find the OSD number: mount the partition and find the whoami file
194
195
<pre>
196
root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/
197
root@server2:/opt/ungleich-tools# cat /mnt/whoami 
198
0
199
root@server2:/opt/ungleich-tools# umount  /mnt/
200
201
</pre>
202
203
* Verify in the *ceph osd tree* that the OSD is on that server
204
* Deleting the OSD
205
** ceph osd crush remove $osd_name
206 1 Nico Schottelius
** ceph osd rm $osd_name
207 14 Nico Schottelius
208
Then continue below as described in "Recreating the OSD".
209
210
h3. Remove Active OSD
211
212
* Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD
213
* Then continue below as described in "Recreating the OSD".
214
215
216
h3. Recreating the OSD
217
218 13 Nico Schottelius
* Create an empty partition table
219
** fdisk /dev/sdX
220
** g
221
** w
222
* Create a new OSD
223
** /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS
224 15 Jin-Guk Kwon
225
h2. How to fix unfound pg
226
227
refer to https://redmine.ungleich.ch/issues/6388
228 16 Jin-Guk Kwon
229
* Check health state 
230
** ceph health detail
231
* Check which server has that osd
232
** ceph osd tree
233
* Check which VM is running in server place
234 17 Jin-Guk Kwon
** virsh list  
235 16 Jin-Guk Kwon
* Check pg map
236 17 Jin-Guk Kwon
** ceph osd map [osd pool] [VMID]
237 18 Jin-Guk Kwon
* revert pg
238
** ceph pg [PGID] mark_unfound_lost revert