Project

General

Profile

The ungleich ceph handbook » History » Version 4

Nico Schottelius, 10/23/2018 08:18 PM

1 1 Nico Schottelius
h1. The ungleich ceph handbook
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Status
6
7
This document is **WORK IN PROGRESS**.
8
9
h2. Introduction
10
11
This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for 
12
13
h2. Communication guide
14
15
Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.
16
17
For this reason communicate whenever I/O recovery settings are temporarily tuned.
18
19 2 Nico Schottelius
h2. Adding a new disk/ssd to the ceph cluster
20
21
h3. For Dell servers
22
23
First find the disk and then add it to the operating system
24
25
<pre>
26
megacli -PDList -aALL  | grep -B16 -i unconfigur
27
28
# Sample output:
29
[19:46:50] server7.place6:~#  megacli -PDList -aALL  | grep -B16 -i unconfigur
30
Enclosure Device ID: N/A
31
Slot Number: 0
32
Enclosure position: N/A
33
Device Id: 0
34
WWN: 0000000000000000
35
Sequence Number: 1
36
Media Error Count: 0
37
Other Error Count: 0
38
Predictive Failure Count: 0
39
Last Predictive Failure Event Seq Number: 0
40
PD Type: SATA
41
42
Raw Size: 894.252 GB [0x6fc81ab0 Sectors]
43
Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors]
44
Coerced Size: 893.75 GB [0x6fb80000 Sectors]
45
Sector Size:  0
46
Firmware state: Unconfigured(good), Spun Up
47
</pre>
48
49
Then add the disk to the OS:
50
51
<pre>
52
megacli -CfgLdAdd -r0 [enclosure:slot] -aX
53
54
# Sample call, if enclosure and slot are KNOWN (aka not N/A)
55
megacli -CfgLdAdd -r0 [32:0] -a0
56
57
# Sample call, if enclosure is N/A
58
megacli -CfgLdAdd -r0 [:0] -a0
59
</pre>
60
61 1 Nico Schottelius
h2. Moving a disk/ssd to another server
62 4 Nico Schottelius
63
(needs to be described better)
64
65
Generally speaking:
66
67
* Stop the osd, remove monit on the server you want to take it out
68
* Take disk out
69
* Discard preserved cache on the server you took it out 
70
* Insert into new server
71
* Clear foreign configuration (megacli)
72
* Disk will no appear in the OS, ceph/udev will automatically start the OSD
73
* Create the monit configuration file so that monit watches the OSD
74
* Reload monit
75 1 Nico Schottelius
76
h2. Removing a disk/ssd
77
78
h2. Handling DOWN osds with filesystem errors
79
80
If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:
81
82
* Login to any ceph monitor (cephX.placeY.ungleich.ch)
83
* Check **ceph -s**, find host using **ceph osd tree**
84
* Login to the affected host
85
* Run the following commands:
86
** ls /var/lib/ceph/osd/ceph-XX
87
** dmesg
88
* Create a new ticket in the datacenter light project
89
** Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch"
90
** Add (partial) output of above commands
91
** Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster
92
** Remove the physical disk from the host, checkout if there is warranty on it and if yes
93
*** Create a short letter to the vendor, including technical details a from above
94
*** Record when you sent it in
95
*** Put ticket into status waiting
96
** If there is no warranty, dispose it
97
98
99
100
h2. Change ceph speed for i/o recovery
101
102
By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.
103
104
The default configuration on our servers contains:
105
106
<pre>
107
[osd]
108
osd max backfills = 1
109
osd recovery max active = 1
110
osd recovery op priority = 2
111
</pre>
112
113
The important settings are *osd max backfills* and *osd recovery max active*, the priority is always kept low so that regular I/O has priority.
114
115
To adjust the number of backfills *per osd* and to change the *number of threads* used for recovery, we can use on any node with the admin keyring:
116
117
<pre>
118
ceph tell osd.* injectargs '--osd-max-backfills Y'
119
ceph tell osd.* injectargs '--osd-recovery-max-active X'
120
</pre>
121
122
where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.
123
124
h2. Debug scrub errors / inconsistent pg message
125
126
From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use *ceph health detail* to find out which placement groups (*pgs*) are affected. Usually a *ceph pg repair <number> fixes the problem.
127
128
If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.