summaryrefslogtreecommitdiffstats
path: root/man/sam_overview.3
blob: 13b45ad588924883d81cd2146d7bc6452bd44681 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
.\"/*
.\" * Copyright (c) 2009-2010 Red Hat, Inc.
.\" *
.\" * All rights reserved.
.\" *
.\" * Author: Jan Friesse (jfriesse@redhat.com)
.\" * Author: Steven Dake (sdake@redhat.com)
.\" *
.\" * This software licensed under BSD license, the text of which follows:
.\" *
.\" * Redistribution and use in source and binary forms, with or without
.\" * modification, are permitted provided that the following conditions are met:
.\" *
.\" * - Redistributions of source code must retain the above copyright notice,
.\" *   this list of conditions and the following disclaimer.
.\" * - Redistributions in binary form must reproduce the above copyright notice,
.\" *   this list of conditions and the following disclaimer in the documentation
.\" *   and/or other materials provided with the distribution.
.\" * - Neither the name of the Red Hat, Inc. nor the names of its
.\" *   contributors may be used to endorse or promote products derived from this
.\" *   software without specific prior written permission.
.\" *
.\" * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
.\" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
.\" * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
.\" * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
.\" * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
.\" * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
.\" * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
.\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
.\" * THE POSSIBILITY OF SUCH DAMAGE.
.\" */
.TH "SAM_OVERVIEW" 3 "21/05/2010" "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"

.SH NAME
.P
sam_overview \- Overview of the Simple Availability Manager

.SH OVERVIEW
.P
The SAM library provide a tool to check the health of an application.
The main purpose of SAM is to restart a local process when it fails to respond
to a healthcheck request in a configured time interval.

.P
During \fBsam_initialize(3)\fR, a duplicate copy of the process is created using
the \fBfork(3)\fR system call.  This duplicate process copy contains the logic
for executing the SAM server.  The SAM server is responsible for requesting
healthchecks from the active process, and controlling the lifecycle of the
active process when it fails.  If the active process fails to respond to the
healthcheck request sent by the SAM server, it will be sent a user configurable
signal (default SIGTERM) to request shutdown of the application.  After a configured time interval, the
process will be forcibly killed by being sent a SIGKILL signal.  Once the
active process terminates, the SAM server will create a new active process.

.P
The Simple Availability Manager is meant to be used in conjunction with the
cpg service.  Used together, it is possible to restart a cpg process that fails
healthchecking during operation.

.P
The main features of SAM include:

.RS
.IP \(bu 3
A configurable recovery policy.
.IP \(bu 3
A configurable time interval for health check operations.
.IP \(bu 3
A notification via signal before recovery action is taken.
.IP \(bu 3
A mechanism to indicate to the application the number of times an active
process has been created by the SAM server.
.IP \(bu 3
Both application driven health checking and event driven health checking.
.RE

.SH Initializing SAM
.P
The SAM library is initialized by \fBsam_initialize(3)\fR.
\fBsam_initalize(3)\fR may only be called once per process.  Calling it more
then once has undefined results and is not recommended or tested.

.SH Setting warning callback
.P
User configurable signal (default \fISIGTERM\fR) is sent to the application when a recovery action is
planned.  The application can use the \fBsignal(3)\fR system call to monitor
for this signal.

.P
There are no special constraints on what SAM apis may be called in a warning
callback.  After \fItime_interval\fR expires, a SIGKILL signal is sent to the
active process to force its termination.

.SH Registering the active process
.P
The active process is registered with SAM by calling \fBsam_register(3)\fR.
This function should only be called one time in a process.  After a recovery
action is taken, the new active process will begin execution at the next line
of code in a user process after \fBsam_register(3)\fR.

.SH Enabling event driven healthchecking
.P
Two types of healthchecking are available to the user.  The first model is one
where the user application healthchecks during its normal operation.  It is
never requested to healtcheck, and if the active process doesn't respond within
the time interval, the process will be restarted.

.P
A more useful mechanism for healthchecking is event driven healthchecking.
Because this model is directed by the SAM server, It isn't necessary to guess
or add timers to the active process to signal a healthcheck operation is
successful.  To use event driven healthchecking,
the \fBsam_hc_callback_register(3)\fR function should be executed.

.SH Quorum integration
.P
SAM has special policies (\fISAM_RECOVERY_POLICY_QUIT\fR and \fISAM_RECOVERY_POLICY_RESTART\fR)
for integration with quorum service. This policies changes SAM behaviour in two aspects.
.RS
.IP \(bu 3
Call of \fBsam_start(3)\fR blocks until corosync becomes quorate
.IP \(bu 3
User selected recovery action is taken immediately after lost of quorum.
.RE

.SH Storing user data
.P
Sometimes there is need to store some data, which survives between instances.
One can in such case use files, databases, ... or much simpler in memory solution
presented by \fBsam_data_store(3)\fR, \fBsam_data_restore(3)\fR and \fBsam_data_getsize(3)\fR
functions.

.SH Confdb integration
.P
SAM has policy flag used for confdb system integration (\fISAM_RECOVERY_POLICY_CONFDB\fR).
If process is registered with this flag, new confdb object PROCESS_NAME:PID is created with following
keys:
.RS
.IP \(bu 3
\fIrecovery\fR - will be quit or restart depending on policy
.IP \(bu 3
\fIpoll_period\fR - period of health checking in milliseconds
.IP \(bu 3
\fIlast_updated\fR - Timestamp (in nanoseconds) of the last health check.
.IP \(bu 3
\fIstate\fR - state of process (can be one of registered, started, failed, waiting for quorum)
.RE

.P
Object is automatically deleted if process exits with stopped health checking.

.P
Confdb integration with corosync watchdog can be used in implicit and explicit way.

.P
Implicit way is achieved by setting recovery policy to QUIT and let process exit with started health checking.
If this happened, object is not deleted and corosync watchdog will take required action.

.P
Explicit way is useful for situations, when developer can deal with some non-fatal fall of application.
This mode is achieved by setting policy to RESTART and using SAM same as without Confdb integration.
If real fail is needed (like too many restarts at all, per/sec, ...), it's possible to use \fBsam_mark_failed(3)\fR
and let corosync watchdog take required action.

.SH BUGS
.SH "SEE ALSO"
.BR sam_initialize (3),
.BR sam_data_getsize (3),
.BR sam_data_restore (3),
.BR sam_data_store (3),
.BR sam_finalize (3),
.BR sam_mark_failed (3),
.BR sam_start (3),
.BR sam_stop (3),
.BR sam_register (3),
.BR sam_warn_signal_set (3),
.BR sam_hc_send (3),
.BR sam_hc_callback_register (3)