1840 lines
68 KiB
HTML
1840 lines
68 KiB
HTML
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"https://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<title>Postfix Queue Scheduler</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<link rel='stylesheet' type='text/css' href='postfix-doc.css'>
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
|
|
Queue Scheduler</h1>
|
|
|
|
<hr>
|
|
|
|
<h2> Disclaimer </h2>
|
|
|
|
<p> Many of the <i>transport</i>-specific configuration parameters
|
|
discussed in this document will not show up in "postconf" command
|
|
output before Postfix version 2.9. This limitation applies to many
|
|
parameters whose name is a combination of a master.cf service name
|
|
such as "relay" and a built-in suffix such as
|
|
"_destination_concurrency_limit". </p>
|
|
|
|
<h2> Overview </h2>
|
|
|
|
<p> The queue manager is by far the most complex part of the Postfix
|
|
mail system. It schedules delivery of new mail, retries failed
|
|
deliveries at specific times, and removes mail from the queue after
|
|
the last delivery attempt. There are two major classes of mechanisms
|
|
that control the operation of the queue manager. </p>
|
|
|
|
<p> Topics covered by this document: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
|
|
with the number of concurrent deliveries to a specific destination,
|
|
including decisions on when to suspend deliveries after persistent
|
|
failures.
|
|
|
|
<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
|
|
the selection of email messages and recipients for a given destination.
|
|
|
|
<li> <a href="#credits"> Credits</a>, something this document would not be
|
|
complete without.
|
|
|
|
</ul>
|
|
|
|
<!--
|
|
|
|
<p> Once started, the qmgr(8) process runs until "postfix reload"
|
|
or "postfix stop". As a persistent process, the queue manager has
|
|
to meet strict requirements with respect to code correctness and
|
|
robustness. Unlike non-persistent daemon processes, the queue manager
|
|
cannot benefit from Postfix's process rejuvenation mechanism that
|
|
limit the impact from resource leaks and other coding errors
|
|
(translation: replacing a process after a short time covers up bugs
|
|
before they can become a problem). </p>
|
|
|
|
-->
|
|
|
|
<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
|
|
|
|
<p> The following sections document the Postfix 2.5 concurrency
|
|
scheduler, after a discussion of the limitations of the earlier
|
|
concurrency scheduler. This is followed by results of medium-concurrency
|
|
experiments, and a discussion of trade-offs between performance and
|
|
robustness. </p>
|
|
|
|
<p> The material is organized as follows: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
|
|
concurrency scheduler </a>
|
|
|
|
<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
|
|
concurrency feedback algorithm </a>
|
|
|
|
<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
|
|
destination" detection algorithm </a>
|
|
|
|
<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
|
|
concurrency scheduler </a>
|
|
|
|
<li> <a href="#concurrency_results"> Results for delivery to
|
|
concurrency limited servers </a>
|
|
|
|
<li> <a href="#concurrency_discussion"> Discussion of concurrency
|
|
limited server results </a>
|
|
|
|
<li> <a href="#concurrency_limitations"> Limitations of less-than-1
|
|
per delivery feedback </a>
|
|
|
|
<li> <a href="#concurrency_config"> Concurrency configuration
|
|
parameters </a>
|
|
|
|
</ul>
|
|
|
|
<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
|
|
concurrency scheduler </a> </h3>
|
|
|
|
<p> From the start, Postfix has used a simple but robust algorithm
|
|
where the per-destination delivery concurrency is decremented by 1
|
|
after delivery failed due to connection or handshake failure, and
|
|
incremented by 1 otherwise. Of course the concurrency is never
|
|
allowed to exceed the maximum per-destination concurrency limit.
|
|
And when a destination's concurrency level drops to zero, the
|
|
destination is declared "dead" and delivery is suspended. </p>
|
|
|
|
<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> Overshoot due to exponential delivery concurrency growth
|
|
with each pseudo-cohort(*). This can be an issue with high-concurrency
|
|
channels. For example, with the default initial concurrency of 5,
|
|
concurrency would proceed over time as (5-10-20). </p>
|
|
|
|
<li> <p> Throttling down to zero concurrency after a single
|
|
pseudo-cohort(*) failure. This was especially an issue with
|
|
low-concurrency channels where a single failure could be sufficient
|
|
to mark a destination as "dead", causing the suspension of further
|
|
deliveries to the affected destination. </p>
|
|
|
|
</ul>
|
|
|
|
<p> (*) A pseudo-cohort is a number of delivery requests equal to
|
|
a destination's delivery concurrency. </p>
|
|
|
|
<p> The revised concurrency scheduler has a highly modular structure.
|
|
It uses separate mechanisms for per-destination concurrency control
|
|
and for "dead destination" detection. The concurrency control in
|
|
turn is built from two separate mechanisms: it supports less-than-1
|
|
feedback per delivery to allow for more gradual concurrency
|
|
adjustments, and it uses feedback hysteresis to suppress concurrency
|
|
oscillations. And instead of waiting for delivery concurrency to
|
|
throttle down to zero, a destination is declared "dead" after a
|
|
configurable number of pseudo-cohorts reports connection or handshake
|
|
failure. </p>
|
|
|
|
<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
|
|
concurrency feedback algorithm </a> </h3>
|
|
|
|
<p> We want to increment a destination's delivery concurrency when
|
|
some (not necessarily consecutive) number of deliveries complete
|
|
without connection or handshake failure. This is implemented with
|
|
positive feedback g(N) where N is the destination's delivery
|
|
concurrency. With g(N)=1 feedback per delivery, concurrency increases
|
|
by 1 after each positive feedback event; this gives us the old
|
|
scheduler's exponential growth in time. With g(N)=1/N feedback per
|
|
delivery, concurrency increases by 1 after an entire pseudo-cohort
|
|
N of positive feedback reports; this gives us linear growth in time.
|
|
Less-than-1 feedback per delivery and integer truncation naturally
|
|
give us hysteresis, so that transitions to larger concurrency happen
|
|
every 1/g(N) positive feedback events. </p>
|
|
|
|
<p> We want to decrement a destination's delivery concurrency when
|
|
some (not necessarily consecutive) number of deliveries complete
|
|
after connection or handshake failure. This is implemented with
|
|
negative feedback f(N) where N is the destination's delivery
|
|
concurrency. With f(N)=1 feedback per delivery, concurrency decreases
|
|
by 1 after each negative feedback event; this gives us the old
|
|
scheduler's behavior where concurrency is throttled down dramatically
|
|
after a single pseudo-cohort failure. With f(N)=1/N feedback per
|
|
delivery, concurrency backs off more gently. Again, less-than-1
|
|
feedback per delivery and integer truncation naturally give us
|
|
hysteresis, so that transitions to lower concurrency happen every
|
|
1/f(N) negative feedback events. </p>
|
|
|
|
<p> However, with negative feedback we introduce a subtle twist.
|
|
We "reverse" the negative hysteresis cycle so that the transition
|
|
to lower concurrency happens at the <b>beginning</b> of a sequence
|
|
of 1/f(N) negative feedback events. Otherwise, a correction for
|
|
overload would be made too late. This makes the choice of f(N)
|
|
relatively unimportant, as borne out by measurements later in this
|
|
document. </p>
|
|
|
|
<p> In summary, the main ingredients for the Postfix 2.5 concurrency
|
|
feedback algorithm are a) the option of less-than-1 positive feedback
|
|
per delivery to avoid overwhelming servers, b) the option of
|
|
less-than-1 negative feedback per delivery to avoid giving up too
|
|
fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
|
|
"reverse" hysteresis cycle for negative feedback, so that it can
|
|
correct for overload quickly. </p>
|
|
|
|
<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
|
|
|
|
<p> We want to suspend deliveries to a specific destination after
|
|
some number of deliveries suffers connection or handshake failure.
|
|
The old scheduler declares a destination "dead" when negative (-1)
|
|
feedback throttles the delivery concurrency down to zero. With
|
|
less-than-1 feedback per delivery, this throttling down would
|
|
obviously take too long. We therefore have to separate "dead
|
|
destination" detection from concurrency feedback. This is implemented
|
|
by introducing the concept of pseudo-cohort failure. The Postfix
|
|
2.5 concurrency scheduler declares a destination "dead" after a
|
|
configurable number of pseudo-cohorts suffers from connection or
|
|
handshake failures. The old scheduler corresponds to the special
|
|
case where the pseudo-cohort failure limit is equal to 1. </p>
|
|
|
|
<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
|
|
|
|
<p> The pseudo code shows how the ideas behind new concurrency
|
|
scheduler are implemented as of November 2007. The actual code can
|
|
be found in the module qmgr/qmgr_queue.c. </p>
|
|
|
|
<pre>
|
|
Types:
|
|
Each destination has one set of the following variables
|
|
int concurrency
|
|
double success
|
|
double failure
|
|
double fail_cohorts
|
|
|
|
Feedback functions:
|
|
N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
|
|
positive feedback: g(N) = x/N | x/sqrt(N) | x
|
|
negative feedback: f(N) = y/N | y/sqrt(N) | y
|
|
|
|
Initialization:
|
|
concurrency = initial_concurrency
|
|
success = 0
|
|
failure = 0
|
|
fail_cohorts = 0
|
|
|
|
After success:
|
|
fail_cohorts = 0
|
|
Be prepared for feedback > hysteresis, or rounding error
|
|
success += g(concurrency)
|
|
while (success >= 1) Hysteresis 1
|
|
concurrency += 1 Hysteresis 1
|
|
failure = 0
|
|
success -= 1 Hysteresis 1
|
|
Be prepared for overshoot
|
|
if (concurrency > concurrency limit)
|
|
concurrency = concurrency limit
|
|
|
|
Safety:
|
|
Don't apply positive feedback unless
|
|
concurrency < busy_refcount + init_dest_concurrency
|
|
otherwise negative feedback effect could be delayed
|
|
|
|
After failure:
|
|
if (concurrency > 0)
|
|
fail_cohorts += 1.0 / concurrency
|
|
if (fail_cohorts > cohort_failure_limit)
|
|
concurrency = 0
|
|
if (concurrency > 0)
|
|
Be prepared for feedback > hysteresis, rounding errors
|
|
failure -= f(concurrency)
|
|
while (failure < 0)
|
|
concurrency -= 1 Hysteresis 1
|
|
failure += 1 Hysteresis 1
|
|
success = 0
|
|
Be prepared for overshoot
|
|
if (concurrency < 1)
|
|
concurrency = 1
|
|
</pre>
|
|
|
|
<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
|
|
|
|
<p> Discussions about the concurrency scheduler redesign started
|
|
early 2004, when the primary goal was to find alternatives that did
|
|
not exhibit exponential growth or rapid concurrency throttling. No
|
|
code was implemented until late 2007, when the primary concern had
|
|
shifted towards better handling of server concurrency limits. For
|
|
this reason we measure how well the new scheduler does this
|
|
job. The table below compares mail delivery performance of the old
|
|
+/-1 feedback per delivery with several less-than-1 feedback
|
|
functions, for different limited-concurrency server scenarios.
|
|
Measurements were done with a FreeBSD 6.2 client and with FreeBSD
|
|
6.2 and various Linux servers. </p>
|
|
|
|
<p> Server configuration: </p>
|
|
|
|
<ul> <li> The mail flow was slowed down with 1 second latency per
|
|
recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
|
|
to make results less dependent on hardware details, by avoiding
|
|
slow-downs by queue file I/O, logging I/O, and network I/O.
|
|
|
|
<li> Concurrency was limited by the server process limit
|
|
("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
|
|
= static:all"). Postfix was stopped and started after changing the
|
|
process limit, because the same number is also used as the backlog
|
|
argument to the listen(2) system call, and "postfix reload" does
|
|
not re-issue this call.
|
|
|
|
<li> Mail was discarded with "local_recipient_maps = static:all" and
|
|
"local_transport = discard". The discard action in access maps or
|
|
header/body checks
|
|
could not be used as it fails to update the in_flow_delay counters.
|
|
|
|
</ul>
|
|
|
|
<p> Client configuration: </p>
|
|
|
|
<ul>
|
|
|
|
<li> Queue file overhead was minimized by sending one message to a
|
|
virtual alias that expanded into 2000 different remote recipients.
|
|
All recipients were accounted for according to the maillog file.
|
|
The virtual_alias_expansion_limit setting was increased to avoid
|
|
complaints from the cleanup(8) server.
|
|
|
|
<li> The number of deliveries was maximized with
|
|
"smtp_destination_recipient_limit = 2". A smaller limit would cause
|
|
Postfix to schedule the concurrency per recipient instead of domain,
|
|
which is not what we want.
|
|
|
|
<li> Maximum concurrency was limited with
|
|
"smtp_destination_concurrency_limit = 20", and
|
|
initial_destination_concurrency was set to the same value.
|
|
|
|
<li> The positive and negative concurrency feedback hysteresis was
|
|
1. Concurrency was incremented by 1 at the END of 1/feedback steps
|
|
of positive feedback, and was decremented by 1 at the START of
|
|
1/feedback steps of negative feedback.
|
|
|
|
<li> The SMTP client used the default 30s SMTP connect timeout and
|
|
300s SMTP greeting timeout.
|
|
|
|
</ul>
|
|
|
|
<h4> Impact of the 30s SMTP connect timeout </h4>
|
|
|
|
<p> The first results are for a FreeBSD 6.2 server, where our
|
|
artificially low listen(2) backlog results in a very short kernel
|
|
queue for established connections. The table shows that all deferred
|
|
deliveries failed due to a 30s connection timeout, and none failed
|
|
due to a server greeting timeout. This measurement simulates what
|
|
happens when the server's connection queue is completely full under
|
|
load, and the TCP engine drops new connections. </p>
|
|
|
|
<blockquote>
|
|
|
|
<table>
|
|
|
|
<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
|
|
style</th> <th>connection<br> caching</th> <th>percentage<br>
|
|
deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
|
|
<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">no</td> <td
|
|
align="center">9.9</td> <td align="center">19.4</td> <td
|
|
align="center">0.49</td> <td align="center">198</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">yes</td> <td
|
|
align="center">10.3</td> <td align="center">19.4</td> <td
|
|
align="center">0.49</td> <td align="center">206</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">no</td>
|
|
<td align="center">10.4</td> <td align="center">19.6</td> <td
|
|
align="center">0.59</td> <td align="center">208</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">yes</td>
|
|
<td align="center">10.6</td> <td align="center">19.6</td> <td
|
|
align="center">0.61</td> <td align="center">212</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">no</td> <td
|
|
align="center">10.1</td> <td align="center">19.5</td> <td
|
|
align="center">1.29</td> <td align="center">202</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr><td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">yes</td> <td
|
|
align="center">10.8</td> <td align="center">19.3</td> <td
|
|
align="center">1.57</td> <td align="center">216</td> <td
|
|
align="center">-</td> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
</table>
|
|
|
|
<p> A busy server with a completely full connection queue. N is
|
|
the client delivery concurrency. Failed deliveries time out after
|
|
30s without completing the TCP handshake. See text for a discussion
|
|
of results. </p>
|
|
|
|
</blockquote>
|
|
|
|
<h4> Impact of the 300s SMTP greeting timeout </h4>
|
|
|
|
<p> The next table shows results for a Fedora Core 8 server (results
|
|
for RedHat 7.3 are identical). In this case, the artificially small
|
|
listen(2) backlog argument does not impact our measurement. The
|
|
table shows that practically all deferred deliveries fail after the
|
|
300s SMTP greeting timeout. As these timeouts were 10x longer than
|
|
with the first measurement, we increased the recipient count (and
|
|
thus the running time) by a factor of 10 to keep the results
|
|
comparable. The deferred mail percentages are a factor 10 lower
|
|
than with the first measurement, because the 1s per-recipient delay
|
|
was 1/300th of the greeting timeout instead of 1/30th of the
|
|
connection timeout. </p>
|
|
|
|
<blockquote>
|
|
|
|
<table>
|
|
|
|
<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
|
|
style</th> <th>connection<br> caching</th> <th>percentage<br>
|
|
deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
|
|
<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">no</td> <td
|
|
align="center">1.16</td> <td align="center">19.8</td> <td
|
|
align="center">0.37</td> <td align="center">-</td> <td
|
|
align="center">230</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">yes</td> <td
|
|
align="center">1.36</td> <td align="center">19.8</td> <td
|
|
align="center">0.36</td> <td align="center">-</td> <td
|
|
align="center">272</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">no</td>
|
|
<td align="center">1.21</td> <td align="center">19.9</td> <td
|
|
align="center">0.23</td> <td align="center">4</td> <td
|
|
align="center">238</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">yes</td>
|
|
<td align="center">1.36</td> <td align="center">20.0</td> <td
|
|
align="center">0.23</td> <td align="center">-</td> <td
|
|
align="center">272</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">no</td> <td
|
|
align="center">1.18</td> <td align="center">20.0</td> <td
|
|
align="center">0.16</td> <td align="center">-</td> <td
|
|
align="center">236</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">yes</td> <td
|
|
align="center">1.39</td> <td align="center">20.0</td> <td
|
|
align="center">0.16</td> <td align="center">-</td> <td
|
|
align="center">278</td> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
</table>
|
|
|
|
<p> A busy server with a non-full connection queue. N is the client
|
|
delivery concurrency. Failed deliveries complete at the TCP level,
|
|
but time out after 300s while waiting for the SMTP greeting. See
|
|
text for a discussion of results. </p>
|
|
|
|
</blockquote>
|
|
|
|
<h4> Impact of active server concurrency limiter </h4>
|
|
|
|
<p> The final concurrency-limited result shows what happens when
|
|
SMTP connections don't time out, but are rejected immediately with
|
|
the Postfix server's smtpd_client_connection_count_limit feature
|
|
(the server replies with a 421 status and disconnects immediately).
|
|
Similar results can be expected with concurrency limiting features
|
|
built into other MTAs or firewalls. For this measurement we specified
|
|
a server concurrency limit and a client initial destination concurrency
|
|
of 5, and a server process limit of 10; all other conditions were
|
|
the same as with the first measurement. The same result would be
|
|
obtained with a FreeBSD or Linux server, because the "pushing back"
|
|
is done entirely by the receiving side. </p>
|
|
|
|
<blockquote>
|
|
|
|
<table>
|
|
|
|
<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
|
|
style</th> <th>connection<br> caching</th> <th>percentage<br>
|
|
deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
|
|
<th>theoretical<br>defer rate</th> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">no</td> <td
|
|
align="center">16.5</td> <td align="center">5.17</td> <td
|
|
align="center">0.38</td> <td align="center">1/6</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/N</td> <td align="center">yes</td> <td
|
|
align="center">16.5</td> <td align="center">5.17</td> <td
|
|
align="center">0.38</td> <td align="center">1/6</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">no</td>
|
|
<td align="center">24.5</td> <td align="center">5.28</td> <td
|
|
align="center">0.45</td> <td align="center">1/4</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1/sqrt(N)</td> <td align="center">yes</td>
|
|
<td align="center">24.3</td> <td align="center">5.28</td> <td
|
|
align="center">0.46</td> <td align="center">1/4</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">no</td> <td
|
|
align="center">49.7</td> <td align="center">5.63</td> <td
|
|
align="center">0.67</td> <td align="center">1/2</td> </tr>
|
|
|
|
<tr> <td align="center">20</td> <td align="center">5</td> <td
|
|
align="center">1</td> <td align="center">yes</td> <td
|
|
align="center">49.7</td> <td align="center">5.68</td> <td
|
|
align="center">0.70</td> <td align="center">1/2</td> </tr>
|
|
|
|
<tr> <td align="center" colspan="9"> <hr> </td> </tr>
|
|
|
|
</table>
|
|
|
|
<p> A server with active per-client concurrency limiter that replies
|
|
with 421 and disconnects. N is the client delivery concurrency.
|
|
The theoretical defer rate is 1/(1+roundup(1/feedback)). This is
|
|
always 1/2 with the fixed +/-1 feedback per delivery; with the
|
|
concurrency-dependent feedback variants, the defer rate decreases
|
|
with increasing concurrency. See text for a discussion of results.
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
|
|
|
|
<p> All results in the previous sections are based on the first
|
|
delivery runs only; they do not include any second etc. delivery
|
|
attempts. It's also worth noting that the measurements look at
|
|
steady-state behavior only. They don't show what happens when the
|
|
client starts sending at a much higher or lower concurrency.
|
|
</p>
|
|
|
|
<p> The first two examples show that the effect of feedback
|
|
is negligible when concurrency is limited due to congestion. This
|
|
is because the initial concurrency is already at the client's
|
|
concurrency maximum, and because there is 10-100 times more positive
|
|
than negative feedback. Under these conditions, it is no surprise
|
|
that the contribution from SMTP connection caching is also negligible.
|
|
</p>
|
|
|
|
<p> In the last example, the old +/-1 feedback per delivery will
|
|
defer 50% of the mail when confronted with an active (anvil-style)
|
|
server concurrency limit, where the server hangs up immediately
|
|
with a 421 status (a TCP-level RST would have the same result).
|
|
Less aggressive feedback mechanisms fare better than more aggressive
|
|
ones. Concurrency-dependent feedback fares even better at higher
|
|
concurrencies than shown here, but has limitations as discussed in
|
|
the next section. </p>
|
|
|
|
<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
|
|
|
|
<p> Less-than-1 feedback is of interest primarily when sending large
|
|
amounts of mail to destinations with active concurrency limiters
|
|
(servers that reply with 421, or firewalls that send RST). When
|
|
sending small amounts of mail per destination, less-than-1 per-delivery
|
|
feedback won't have a noticeable effect on the per-destination
|
|
concurrency, because the number of deliveries to the same destination
|
|
is too small. You might just as well use zero per-delivery feedback
|
|
and stay with the initial per-destination concurrency. And when
|
|
mail deliveries fail due to congestion instead of active concurrency
|
|
limiters, the measurements above show that per-delivery feedback
|
|
has no effect. With large amounts of mail you might just as well
|
|
use zero per-delivery feedback and start with the maximal per-destination
|
|
concurrency. </p>
|
|
|
|
<p> The scheduler with less-than-1 concurrency
|
|
feedback per delivery solves a problem with servers that have active
|
|
concurrency limiters. This works only because feedback is handled
|
|
in a peculiar manner: positive feedback will increment the concurrency
|
|
by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
|
|
while negative feedback will decrement concurrency by 1 at the
|
|
<b>beginning</b> of such a sequence. This is how Postfix adjusts
|
|
quickly for overshoot without causing lots of mail to be deferred.
|
|
Without this difference in feedback treatment, less-than-1 feedback
|
|
per delivery would defer 50% of the mail, and would be no better
|
|
in this respect than the old +/-1 feedback per delivery. </p>
|
|
|
|
<p> Unfortunately, the same feature that corrects quickly for
|
|
concurrency overshoot also makes the scheduler more sensitive for
|
|
noisy negative feedback. The reason is that one lonely negative
|
|
feedback event has the same effect as a complete sequence of length
|
|
1/feedback: in both cases delivery concurrency is dropped by 1
|
|
immediately. As a worst-case scenario, consider multiple servers
|
|
behind a load balancer on a single IP address, and no backup MX
|
|
address. When 1 out of K servers fails to complete the SMTP handshake
|
|
or drops the connection, a scheduler with 1/N (N = concurrency)
|
|
feedback stops increasing its concurrency once it reaches a concurrency
|
|
level of about K, even though the good servers behind the load
|
|
balancer are perfectly capable of handling more traffic. </p>
|
|
|
|
<p> This noise problem gets worse as the amount of positive feedback
|
|
per delivery gets smaller. A compromise is to use fixed less-than-1
|
|
positive feedback values instead of concurrency-dependent positive
|
|
feedback. For example, to tolerate 1 of 4 bad servers in the above
|
|
load balancer scenario, use positive feedback of 1/4 per "good"
|
|
delivery (no connect or handshake error), and use an equal or smaller
|
|
amount of negative feedback per "bad" delivery. The downside of
|
|
using concurrency-independent feedback is that some of the old +/-1
|
|
feedback problems will return at large concurrencies. Sites that
|
|
must deliver mail at non-trivial per-destination concurrencies will
|
|
require special configuration. </p>
|
|
|
|
<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
|
|
|
|
<p> The Postfix 2.5 concurrency scheduler is controlled with the
|
|
following configuration parameters, where "<i>transport</i>_foo"
|
|
provides a transport-specific parameter override. All parameter
|
|
default settings are compatible with earlier Postfix versions. </p>
|
|
|
|
<blockquote>
|
|
|
|
<table border="0">
|
|
|
|
<tr> <th> Parameter name </th> <th> Postfix version </th> <th>
|
|
Description </th> </tr>
|
|
|
|
<tr> <td colspan="3"> <hr> </td> </tr>
|
|
|
|
<tr> <td> initial_destination_concurrency<br>
|
|
<i>transport</i>_initial_destination_concurrency </td> <td
|
|
align="center"> all<br> 2.5 </td> <td> Initial per-destination
|
|
delivery concurrency </td> </tr>
|
|
|
|
<tr> <td> default_destination_concurrency_limit<br>
|
|
<i>transport</i>_destination_concurrency_limit </td> <td align="center">
|
|
all<br> all </td> <td> Maximum per-destination delivery concurrency
|
|
</td> </tr>
|
|
|
|
<tr> <td> default_destination_concurrency_positive_feedback<br>
|
|
<i>transport</i>_destination_concurrency_positive_feedback </td>
|
|
<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
|
|
feedback amount, per delivery that does not fail with connection
|
|
or handshake failure </td> </tr>
|
|
|
|
<tr> <td> default_destination_concurrency_negative_feedback<br>
|
|
<i>transport</i>_destination_concurrency_negative_feedback </td>
|
|
<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
|
|
feedback amount, per delivery that fails with connection or handshake
|
|
failure </td> </tr>
|
|
|
|
<tr> <td> default_destination_concurrency_failed_cohort_limit<br>
|
|
<i>transport</i>_destination_concurrency_failed_cohort_limit </td>
|
|
<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
|
|
pseudo-cohorts after which a destination is declared "dead" and
|
|
delivery is suspended </td> </tr>
|
|
|
|
<tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
|
|
2.5 </td> <td> Enable verbose logging of concurrency scheduler
|
|
activity </td> </tr>
|
|
|
|
<tr> <td colspan="3"> <hr> </td> </tr>
|
|
|
|
</table>
|
|
|
|
</blockquote>
|
|
|
|
<h2> <a name="jobs"> Preemptive scheduling </a> </h2>
|
|
|
|
<p>
|
|
|
|
The following sections describe the new queue manager and its
|
|
preemptive scheduler algorithm. Note that the document was originally
|
|
written to describe the changes between the new queue manager (in
|
|
this text referred to as <tt>nqmgr</tt>, the name it was known by
|
|
before it became the default queue manager) and the old queue manager
|
|
(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
|
|
every so often.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
This document is divided into sections as follows:
|
|
|
|
</p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
|
|
nqmgr </a>
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
|
|
up the message </a> - how it is assigned to transports, jobs, peers,
|
|
entries
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
|
|
works </a>
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
|
|
works </a> - what messages may be preempted and how and what messages
|
|
are chosen to preempt them
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
|
|
limits affect the scheduling algorithm </a>
|
|
|
|
<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
|
|
limits </a>
|
|
|
|
</ul>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
|
|
nqmgr </a> </h3>
|
|
|
|
<p>
|
|
|
|
Let's start by recapitulating the structures and terms used when
|
|
referring to the queue manager and how it operates. Many of these are
|
|
partially described elsewhere, but it is nice to have a coherent
|
|
overview in one place:
|
|
|
|
</p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> Each message structure represents one mail message which
|
|
Postfix is to deliver. The message recipients specify to what
|
|
destinations is the message to be delivered and what transports are
|
|
going to be used for the delivery. </p>
|
|
|
|
<li> <p> Each recipient entry groups a batch of recipients of one
|
|
message which are all going to be delivered to the same destination
|
|
(and over the same transport).
|
|
</p>
|
|
|
|
<li> <p> Each transport structure groups everything what is going
|
|
to be delivered by delivery agents dedicated for that transport.
|
|
Each transport maintains a set of queues (describing the destinations
|
|
it shall talk to) and jobs (referencing the messages it shall
|
|
deliver). </p>
|
|
|
|
<li> <p> Each transport queue (not to be confused with the on-disk
|
|
"active" queue or "incoming" queue) groups everything what is going be
|
|
delivered to given destination (aka nexthop) by its transport. Each
|
|
queue belongs to one transport, so each destination may be referred
|
|
to by several queues, one for each transport. Each queue maintains
|
|
a list of all recipient entries (batches of message recipients)
|
|
which shall be delivered to given destination (the todo list), and
|
|
a list of recipient entries already being delivered by the delivery
|
|
agents (the busy list). </p>
|
|
|
|
<li> <p> Each queue corresponds to multiple peer structures. Each
|
|
peer structure is like the queue structure, belonging to one transport
|
|
and referencing one destination. The difference is that it lists
|
|
only the recipient entries which all originate from the same message,
|
|
unlike the queue structure, whose entries may originate from various
|
|
messages. For messages with few recipients, there is usually just
|
|
one recipient entry for each destination, resulting in one recipient
|
|
entry per peer. But for large mailing list messages the recipients
|
|
may need to be split to multiple recipient entries, in which case
|
|
the peer structure may list many entries for single destination.
|
|
</p>
|
|
|
|
<li> <p> Each transport job groups everything it takes to deliver
|
|
one message via its transport. Each job represents one message
|
|
within the context of the transport. The job belongs to one transport
|
|
and message, so each message may have multiple jobs, one for each
|
|
transport. The job groups all the peer structures, which describe
|
|
the destinations the job's message has to be delivered to. </p>
|
|
|
|
</ul>
|
|
|
|
<p>
|
|
|
|
The first four structures are common to both <tt>nqmgr</tt> and
|
|
<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
These terms are used extensively in the text below, feel free to
|
|
look up the description above anytime you'll feel you have lost a
|
|
sense what is what.
|
|
|
|
</p>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
|
|
up the message </a> </h3>
|
|
|
|
<p>
|
|
|
|
Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue,
|
|
the following happens: It reads all necessary information from the
|
|
queue file as <tt>oqmgr</tt> does, and also reads as many recipients
|
|
as possible - more on that later, for now let's just pretend it
|
|
always reads all recipients.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Then it resolves the recipients as <tt>oqmgr</tt> does, which
|
|
means obtaining (address, nexthop, transport) triple for each
|
|
recipient. For each triple, it finds the transport; if it does not
|
|
exist yet, it instantiates it (unless it's dead). Within the
|
|
transport, it finds the destination queue for the given nexthop; if it
|
|
does not exist yet, it instantiates it (unless it's dead). The
|
|
triple is then bound to given destination queue. This happens in
|
|
qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Then for each triple which was bound to some queue (and thus
|
|
transport), the program finds the job which represents the message
|
|
within that transport's context; if it does not exist yet, it
|
|
instantiates it. Within the job, it finds the peer which represents
|
|
the bound destination queue within this jobs context; if it does
|
|
not exist yet, it instantiates it. Finally, it stores the address
|
|
from the resolved triple to the recipient entry which is appended
|
|
to both the queue entry list and the peer entry list. The addresses
|
|
for the same nexthop are batched in the entries up to the
|
|
<i>transport</i>_destination_recipient_limit for that transport.
|
|
This happens in qmgr_message_assign(), and apart
|
|
from that it operates with job and peer structures, it is basically the
|
|
same as in <tt>oqmgr</tt>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
When the job is instantiated, it is enqueued on the transport's job
|
|
list based on the time its message was picked up by <tt>nqmgr</tt>.
|
|
For first batch of recipients this means it is appended to the end
|
|
of the job list, but the ordering of the job list by the enqueue
|
|
time is important as we will see shortly.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[Now you should have a pretty good idea what the state of the
|
|
<tt>nqmgr</tt> is after a couple of messages were picked up, and what the
|
|
relation is between all those job, peer, queue and entry structures.]
|
|
|
|
</p>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
|
|
works </a> </h3>
|
|
|
|
<p>
|
|
|
|
Having prepared all those above mentioned structures, the task of
|
|
the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
|
|
one at a time and pass them to the delivery agent for corresponding
|
|
transport. Now how does this work?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The first approximation of the new scheduling algorithm is like this:
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
foreach transport (round-robin-by-transport)
|
|
do
|
|
if transport busy continue
|
|
if transport process limit reached continue
|
|
foreach transport's job (in the order of the transport's job list)
|
|
do
|
|
foreach job's peer (round-robin-by-destination)
|
|
if peer->queue->concurrency < peer->queue->window
|
|
return next peer entry.
|
|
done
|
|
done
|
|
done
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
Now what is the "order of the transport's job list"? As we know
|
|
already, the job list is by default kept in the order the message
|
|
was picked up by the <tt>nqmgr</tt>. So by default we get the
|
|
top-level round-robin transport, and within each transport we get
|
|
the FIFO message delivery. The round-robin of the peers by the
|
|
destination is perhaps of little importance in most real-life cases
|
|
(unless the <i>transport</i>_destination_recipient_limit is reached,
|
|
in one job there
|
|
is only one peer structure for each destination), but theoretically
|
|
it makes sure that even within single jobs, destinations are treated
|
|
fairly.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[By now you should have a feeling you really know how the scheduler
|
|
works, except for the preemption, under ideal conditions - that is,
|
|
no recipient resource limits and no destination concurrency problems.]
|
|
|
|
</p>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
|
|
works </a> </h3>
|
|
|
|
<p>
|
|
|
|
As you might perhaps expect by now, the transport's job list does
|
|
not remain sorted by the job's message enqueue time all the time.
|
|
The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
|
|
delivery, but that it is able to slip mail with little recipients
|
|
past the mailing-list bulk mail. This is what the job preemption
|
|
is about - shuffling the jobs on the transport's job list to get
|
|
the best message delivery rates. Now how is it achieved?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
First I have to tell you that there are in fact two job lists in
|
|
each transport. One is the scheduler's job list, which the scheduler
|
|
is free to play with, while the other one keeps the jobs always
|
|
listed in the order of the enqueue time and is used for recipient
|
|
pool management we will discuss later. For now, we will deal with
|
|
the scheduler's job list only.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
So, we have the job list, which is first ordered by the time the
|
|
jobs' messages were enqueued, oldest messages first, the most recently
|
|
picked one at the end. For now, let's assume that there are no
|
|
destination concurrency problems. Without preemption, we pick some
|
|
entry of the first (oldest) job on the queue, assign it to delivery
|
|
agent, pick another one from the same job, assign it again, and so
|
|
on, until all the entries are used and the job is delivered. We
|
|
would then move onto the next job and so on and on. Now how do we
|
|
manage to sneak in some entries from the recently added jobs when
|
|
the first job on the job list belongs to a message going to the
|
|
mailing-list and has thousands of recipient entries?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
|
|
the delivery time of that first job by some constant for free - it
|
|
is basically the same trick you might remember as "accumulation of
|
|
potential" from the amortized complexity lessons. For example,
|
|
instead of delivering the entries of the first job on the job list
|
|
every time a delivery agent becomes available, we can do it only
|
|
every second time. If you view the moments the delivery agent becomes
|
|
available on a timeline as "delivery slots", then instead of using
|
|
every delivery slot for the first job, we can use only every other
|
|
slot, and still the overall delivery efficiency of the first job
|
|
remains the same. So the delivery <tt>11112222</tt> becomes
|
|
<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
|
|
denotes the free slot). Now what do we do with free slots?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
As you might have guessed, we will use them for sneaking the mail
|
|
with little recipients in. For example, if we have one four-recipient
|
|
mail followed by four one recipients mail, the delivery sequence
|
|
(that is, the sequence in which the jobs are assigned to the
|
|
delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
|
|
for sneaking in the single recipient mail, but how do we sneak in
|
|
the mail with more than one recipient? Say if we have one four-recipient
|
|
mail followed by two two-recipient mails?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The simple answer would be to use delivery sequence <tt>12121313</tt>.
|
|
But the problem is that this does not scale well. Imagine you have
|
|
mail with a thousand recipients followed by mail with a hundred recipients.
|
|
It is tempting to suggest the delivery sequence like <tt>121212....</tt>,
|
|
but alas! Imagine there arrives another mail with say ten recipients.
|
|
But there are no free slots anymore, so it can't slip by, not even
|
|
if it had only one recipient. It will be stuck until the
|
|
hundred-recipient mail is delivered, which really sucks.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
So, it becomes obvious that while inflating the message to get
|
|
free slots is a great idea, one has to be really careful of how the
|
|
free slots are assigned, otherwise one might corner himself. So,
|
|
how does <tt>nqmgr</tt> really use the free slots?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The key idea is that one does not have to generate the free slots
|
|
in a uniform way. The delivery sequence <tt>111...1</tt> is no
|
|
worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
|
|
entries are in the first case selected earlier than in the second
|
|
case, and none is selected later! So it is possible to first
|
|
"accumulate" the free delivery slots and then use them all at once.
|
|
It is even possible to accumulate some, then use them, then accumulate
|
|
some more and use them again, as in <tt>11..1.1</tt> .
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Let's get back to the one hundred recipient example. We now know
|
|
that we could first accumulate one hundred free slots, and only
|
|
after then to preempt the first job and sneak the one hundred
|
|
recipient mail in. Applying the algorithm recursively, we see the
|
|
hundred recipient job can accumulate ten free delivery slots, and
|
|
then we could preempt it and sneak in the ten-recipient mail...
|
|
Wait wait wait! Could we? Aren't we overinflating the original one
|
|
thousand recipient mail?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Well, despite the fact that it looks so at the first glance, another trick will
|
|
allow us to answer "no, we are not!". If we had said that we will
|
|
inflate the delivery time twice at maximum, and then we consider
|
|
every other slot as a free slot, then we would overinflate in case
|
|
of the recursive preemption. BUT! The trick is that if we use only
|
|
every n-th slot as a free slot for n>2, there is always some worst
|
|
inflation factor which we can guarantee not to be breached, even
|
|
if we apply the algorithm recursively. To be precise, if for every
|
|
k>1 normally used slots we accumulate one free delivery slot, than
|
|
the inflation factor is not worse than k/(k-1) no matter how many
|
|
recursive preemptions happen. And it's not worse than (k+1)/k if
|
|
only non-recursive preemption happens. Now, having got through the
|
|
theory and the related math, let's see how <tt>nqmgr</tt> implements
|
|
this.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Each job has so called "available delivery slot" counter. Each
|
|
transport has a <i>transport</i>_delivery_slot_cost parameter, which
|
|
defaults to default_delivery_slot_cost parameter which is set to 5
|
|
by default. This is the k from the paragraph above. Each time k
|
|
entries of the job are selected for delivery, this counter is
|
|
incremented by one. Once there are some slots accumulated, a job which
|
|
requires no more than that number of slots to be fully delivered
|
|
can preempt this job.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[Well, the truth is, the counter is incremented every time an entry
|
|
is selected and it is divided by k when it is used.
|
|
But to understand, it's good enough to use
|
|
the above approximation of the truth.]
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
OK, so now we know the conditions which must be satisfied so one
|
|
job can preempt another one. But what job gets preempted, how do
|
|
we choose what job preempts it if there are several valid candidates,
|
|
and when does all this exactly happen?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The answer for the first part is simple. The job whose entry was
|
|
selected the last time is the so called current job. Normally, it is
|
|
the first job on the scheduler's job list, but destination concurrency
|
|
limits may change this as we will see later. It is always only the
|
|
current job which may get preempted.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Now for the second part. The current job has a certain amount of
|
|
recipient entries, and as such may accumulate at maximum some amount
|
|
of available delivery slots. It might have already accumulated some,
|
|
and perhaps even already used some when it was preempted before
|
|
(remember a job can be preempted several times). In either case,
|
|
we know how many are accumulated and how many are left to deliver,
|
|
so we know how many it may yet accumulate at maximum. Every other
|
|
job which may be delivered by less than that number of slots is a
|
|
valid candidate for preemption. How do we choose among them?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The answer is - the one with maximum enqueue_time/recipient_entry_count.
|
|
That is, the older the job is, the more we should try to deliver
|
|
it in order to get best message delivery rates. These rates are of
|
|
course subject to how many recipients the message has, therefore
|
|
the division by the recipient (entry) count. No one shall be surprised
|
|
that a message with n recipients takes n times longer to deliver than
|
|
a message with one recipient.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Now let's recap the previous two paragraphs. Isn't it too complicated?
|
|
Why don't the candidates come only among the jobs which can be
|
|
delivered within the number of slots the current job already
|
|
accumulated? Why do we need to estimate how much it has yet to
|
|
accumulate? If you found out the answer, congratulate yourself. If
|
|
we did it this simple way, we would always choose the candidate
|
|
with the fewest recipient entries. If there were enough single recipient
|
|
mails coming in, they would always slip by the bulk mail as soon
|
|
as possible, and the two or more recipients mail would never get
|
|
a chance, no matter how long they have been sitting around in the
|
|
job list.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
This candidate selection has an interesting implication - that when
|
|
we choose the best candidate for preemption (this is done in
|
|
qmgr_choose_candidate()), it may happen that we may not use it for
|
|
preemption immediately. This leads to an answer to the last part
|
|
of the original question - when does the preemption happen?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The preemption attempt happens every time next transport's recipient
|
|
entry is to be chosen for delivery. To avoid needless overhead, the
|
|
preemption is not attempted if the current job could never accumulate
|
|
more than <i>transport</i>_minimum_delivery_slots (defaults to
|
|
default_minimum_delivery_slots which defaults to 3). If there are
|
|
already enough accumulated slots to preempt the current job by the
|
|
chosen best candidate, it is done immediately. This basically means
|
|
that the candidate is moved in front of the current job on the
|
|
scheduler's job list and decreasing the accumulated slot counter
|
|
by the amount used by the candidate. If there are not enough slots...
|
|
well, I could say that nothing happens and the another preemption
|
|
is attempted the next time. But that's not the complete truth.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The truth is that it turns out that it is not really necessary to
|
|
wait until the jobs counter accumulates all the delivery slots in
|
|
advance. Say we have ten-recipient mail followed by two two-recipient
|
|
mails. If the preemption happened when enough delivery slots accumulate
|
|
(assuming slot cost 2), the delivery sequence becomes
|
|
<tt>11112211113311</tt>. Now what would we get if we would wait
|
|
only for 50% of the necessary slots to accumulate and we promise
|
|
we would wait for the remaining 50% later, after we get back
|
|
to the preempted job? If we use such a slot loan, the delivery sequence
|
|
becomes <tt>11221111331111</tt>. As we can see, it makes it not
|
|
considerably worse for the delivery of the ten-recipient mail, but
|
|
it allows the small messages to be delivered sooner.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The concept of these slot loans is where the
|
|
<i>transport</i>_delivery_slot_discount and
|
|
<i>transport</i>_delivery_slot_loan come from (they default to
|
|
default_delivery_slot_discount and default_delivery_slot_loan, whose
|
|
values are by default 50 and 3, respectively). The discount (resp.
|
|
loan) specifies how many percent (resp. how many slots) one "gets
|
|
in advance", when the number of slots required to deliver the best
|
|
candidate is compared with the number of slots the current slot had
|
|
accumulated so far.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
And that pretty much concludes this chapter.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[Now you should have a feeling that you pretty much understand the
|
|
scheduler and the preemption, or at least that you will have
|
|
after you read the last chapter a couple more times. You shall clearly
|
|
see the job list and the preemption happening at its head, in ideal
|
|
delivery conditions. The feeling of understanding shall last until
|
|
you start wondering what happens if some of the jobs are blocked,
|
|
which you might eventually figure out correctly from what had been
|
|
said already. But I would be surprised if your mental image of the
|
|
scheduler's functionality is not completely shattered once you
|
|
start wondering how it works when not all recipients may be read
|
|
in-core. More on that later.]
|
|
|
|
</p>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
|
|
limits affect the scheduling algorithm </a> </h3>
|
|
|
|
<p>
|
|
|
|
The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
|
|
control as <tt>oqmgr</tt>. Now what happens when the destination
|
|
limits are reached and no more entries for that destination may be
|
|
selected by the scheduler?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
From the user's point of view it is all simple. If some of the peers
|
|
of a job can't be selected, those peers are simply skipped by the
|
|
entry selection algorithm (the pseudo-code described before) and
|
|
only the selectable ones are used. If none of the peers may be
|
|
selected, the job is declared a "blocker job". Blocker jobs are
|
|
skipped by the entry selection algorithm and they are also excluded
|
|
from the candidates for preemption of the current job. Thus the scheduler
|
|
effectively behaves as if the blocker jobs didn't exist on the job
|
|
list at all. As soon as at least one of the peers of a blocker job
|
|
becomes unblocked (that is, the delivery agent handling the delivery
|
|
of the recipient entry for the given destination successfully finishes),
|
|
the job's blocker status is removed and the job again participates
|
|
in all further scheduler actions normally.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
So the summary is that the users don't really have to be concerned
|
|
about the interaction of the destination limits and scheduling
|
|
algorithm. It works well on its own and there are no knobs they
|
|
would need to control it.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
From a programmer's point of view, the blocker jobs complicate the
|
|
scheduler quite a lot. Without them, the jobs on the job list would
|
|
be normally delivered in strict FIFO order. If the current job is
|
|
preempted, the job preempting it is completely delivered unless it
|
|
is preempted itself. Without blockers, the current job is thus
|
|
always either the first job on the job list, or the top of the stack
|
|
of jobs preempting the first job on the job list.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The visualization of the job list and the preemption stack without
|
|
blockers would be like this:
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
first job-> 1--2--3--5--6--8--... <- job list
|
|
on job list |
|
|
4 <- preemption stack
|
|
|
|
|
current job-> 7
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
In the example above we see that job 1 was preempted by job 4 and
|
|
then job 4 was preempted by job 7. After job 7 is completed, remaining
|
|
entries of job 4 are selected, and once they are all selected, job
|
|
1 continues.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
As we see, it's all very clean and straightforward. Now how does
|
|
this change because of blockers?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The answer is: a lot. Any job may become a blocker job at any time,
|
|
and also become a normal job again at any time. This has several
|
|
important implications:
|
|
|
|
</p>
|
|
|
|
<ol>
|
|
|
|
<li> <p>
|
|
|
|
The jobs may be completed in arbitrary order. For example, in the
|
|
example above, if the current job 7 becomes blocked, the next job
|
|
4 may complete before the job 7 becomes unblocked again. Or if both
|
|
7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
|
|
and is completed, then 2 is completed and only after that 4 becomes
|
|
unblocked and is completed... You get the idea.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[Interesting side note: even when jobs are delivered out of order,
|
|
from a single destination's point of view the jobs are still delivered
|
|
in the expected order (that is, FIFO unless there was some preemption
|
|
involved). This is because whenever a destination queue becomes
|
|
unblocked (the destination limit allows selection of more recipient
|
|
entries for that destination), all jobs which have peers for that
|
|
destination are unblocked at once.]
|
|
|
|
</p>
|
|
|
|
<li> <p>
|
|
|
|
The idea of the preemption stack at the head of the job list is
|
|
gone. That is, it must be possible to preempt any job on the job
|
|
list. For example, if the jobs 7, 4, 1 and 2 in the example above
|
|
become all blocked, job 3 becomes the current job. And of course
|
|
we do not want the preemption to be affected by the fact that there
|
|
are some blocked jobs or not. Therefore, if it turns out that job
|
|
3 might be preempted by job 6, the implementation shall make it
|
|
possible.
|
|
|
|
</p>
|
|
|
|
<li> <p>
|
|
|
|
The idea of the linear preemption stack itself is gone. It's no
|
|
longer true that one job is always preempted by only one job at one
|
|
time (that is directly preempted, not counting the recursively
|
|
nested jobs). For example, in the example above, job 1 is directly
|
|
preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
|
|
blocked, and job 4 is being delivered. If it accumulates enough
|
|
delivery slots, it is natural that it might be preempted for example
|
|
by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
|
|
same time.
|
|
|
|
</p>
|
|
|
|
</ol>
|
|
|
|
<p>
|
|
|
|
Now combine the points 2) and 3) with point 1) again and you realize
|
|
that the relations on the once linear job list became pretty
|
|
complicated. If we extend the point 3) example: jobs 7 and 8 preempt
|
|
job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
|
|
huh?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
If I illustrate the relations after the above mentioned examples
|
|
(but those in point 1), the situation would look like this:
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
v- parent
|
|
|
|
adoptive parent -> 1--2--3--5--... <- "stack" level 0
|
|
| |
|
|
parent gone -> ? 6 <- "stack" level 1
|
|
/ \
|
|
children -> 7 8 ^- child <- "stack" level 2
|
|
|
|
^- siblings
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
Now how does <tt>nqmgr</tt> deal with all these complicated relations?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Well, it maintains them all as described, but fortunately, all these
|
|
relations are necessary only for the purpose of proper counting of
|
|
available delivery slots. For the purpose of ordering the jobs for
|
|
entry selection, the original rule still applies: "the job preempting
|
|
the current job is moved in front of the current job on the job
|
|
list". So for entry selection purposes, the job relations remain
|
|
as simple as this:
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
7--8--1--2--6--3--5--.. <- scheduler's job list order
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
The job list order and the preemption parent/child/siblings relations
|
|
are maintained separately. And because the selection works only
|
|
with the job list, you can happily forget about those complicated
|
|
relations unless you want to study the <tt>nqmgr</tt> sources. In
|
|
that case the text above might provide some helpful introduction
|
|
to the problem domain. Otherwise I suggest you just forget about
|
|
all this and stick with the user's point of view: the blocker jobs
|
|
are simply ignored.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[By now, you should have a feeling that there are more things going
|
|
on under the hood than you ever wanted to know. You decide that
|
|
forgetting about this chapter is the best you can do for the sake
|
|
of your mind's health and you basically stick with the idea how the
|
|
scheduler works in ideal conditions, when there are no blockers,
|
|
which is good enough.]
|
|
|
|
</p>
|
|
|
|
<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
|
|
limits </a> </h3>
|
|
|
|
<p>
|
|
|
|
When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
|
|
that all recipients of all messages in the "active" queue are completely
|
|
read into memory. This is simply not true. There is an upper
|
|
bound on the amount of memory the <tt>nqmgr</tt> may use, and
|
|
therefore it must impose some limits on the information it may store
|
|
in memory at any given time.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
First of all, not all messages may be read in-core at once. At any
|
|
time, only qmgr_message_active_limit messages may be read in-core
|
|
at maximum. When read into memory, the messages are picked from the
|
|
"incoming" and "deferred" queues and moved to the "active" queue
|
|
(incoming having priority), so if there are more than
|
|
qmgr_message_active_limit messages queued in the "active" queue, the
|
|
rest will have to wait until (some of) the messages in the "active" queue
|
|
are completely delivered (or deferred).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Even with the limited amount of in-core messages, there is another
|
|
limit which must be imposed in order to avoid memory exhaustion.
|
|
Each message may contain a huge number of recipients (tens or hundreds
|
|
of thousands are not uncommon), so if <tt>nqmgr</tt> read all
|
|
recipients of all messages in the "active" queue, it may easily run
|
|
out of memory. Therefore there must be some upper bound on the
|
|
amount of message recipients which are read into memory at the
|
|
same time.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Before discussing how exactly <tt>nqmgr</tt> implements the recipient
|
|
limits, let's see how the sole existence of the limits themselves
|
|
affects the <tt>nqmgr</tt> and its scheduler.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The message limit is straightforward - it just limits the size of
|
|
the
|
|
lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
|
|
message can preempt the current one. Messages not in the "active" queue
|
|
are simply not considered at all.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The recipient limit complicates more things. First of all, the
|
|
message reading code must support reading the recipients in batches,
|
|
which among other things means accessing the queue file several
|
|
times and continuing where the last recipient batch ended. This is
|
|
invoked by the scheduler whenever the current job has space for more
|
|
recipients, subject to transport's refill_limit and refill_delay parameters.
|
|
It is also done any time when all
|
|
in-core recipients of the message are dealt with (which may also
|
|
mean they were deferred) but there are still more in the queue file.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The second complication is that with some recipients left unread
|
|
in the queue file, the scheduler can't operate with exact counts
|
|
of recipient entries. With unread recipients, it is not clear how
|
|
many recipient entries there will be, as they are subject to
|
|
per-destination grouping. It is not even clear to what transports
|
|
(and thus jobs) the recipients will be assigned. And with messages
|
|
coming from the "deferred" queue, it is not even clear how many unread
|
|
recipients are still to be delivered. This all means that the
|
|
scheduler must use only estimates of how many recipients entries
|
|
there will be. Fortunately, it is possible to estimate the minimum
|
|
and maximum correctly, so the scheduler can always err on the safe
|
|
side. Obviously, the better the estimates, the better the results, so
|
|
it is best when we are able to read all recipients in-core and turn
|
|
the estimates into exact counts, or at least try to read as many
|
|
as possible to make the estimates as accurate as possible.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The third complication is that it is no longer true that the scheduler
|
|
is done with a job once all of its in-core recipients are delivered.
|
|
It is possible that the job will be revived later, when another
|
|
batch of recipients is read in core. It is also possible that some
|
|
jobs will be created for the first time long after the first batch
|
|
of recipients was read in core. The <tt>nqmgr</tt> code must be
|
|
ready to handle all such situations.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
And finally, the fourth complication is that the <tt>nqmgr</tt>
|
|
code must somehow impose the recipient limit itself. Now how does
|
|
it achieve it?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Perhaps the easiest solution would be to say that each message may
|
|
have at maximum X recipients stored in-core, but such a solution would
|
|
be poor for several reasons. With reasonable qmgr_message_active_limit
|
|
values, the X would have to be quite low to maintain a reasonable
|
|
memory footprint. And with low X lots of things would not work well.
|
|
The <tt>nqmgr</tt> would have problems to use the
|
|
<i>transport</i>_destination_recipient_limit efficiently. The
|
|
scheduler's preemption would be suboptimal as the recipient count
|
|
estimates would be inaccurate. The message queue file would have
|
|
to be accessed many times to read in more recipients again and
|
|
again.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Therefore it seems reasonable to have a solution which does not use
|
|
a limit imposed on a per-message basis, but which maintains a pool
|
|
of available recipient slots, which can be shared among all messages
|
|
in the most efficient manner. And as we do not want separate
|
|
transports to compete for resources whenever possible, it seems
|
|
appropriate to maintain such a recipient pool for each transport
|
|
separately. This is the general idea, now how does it work in
|
|
practice?
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
First we have to solve a little chicken-and-egg problem. If we want
|
|
to use the per-transport recipient pools, we first need to know to
|
|
what transport(s) the message is assigned. But we will find that
|
|
out only after we first read in the recipients. So it is obvious
|
|
that we first have to read in some recipients, use them to find out
|
|
to what transports the message is to be assigned, and only after
|
|
that can we use the per-transport recipient pools.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Now how many recipients shall we read for the first time? This is
|
|
what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
|
|
values control. The qmgr_message_recipient_minimum value specifies
|
|
how many recipients of each message we will read the first time,
|
|
no matter what. It is necessary to read at least one recipient
|
|
before we can assign the message to a transport and create the first
|
|
job. However, reading only qmgr_message_recipient_minimum recipients
|
|
even if there are only few messages with few recipients in-core would
|
|
be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit
|
|
recipients in-core so far, the first batch of recipients may be
|
|
larger than qmgr_message_recipient_minimum - as large as is required
|
|
to reach the qmgr_message_recipient_limit limit.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Once the first batch of recipients was read in core and the message
|
|
jobs were created, the size of the subsequent recipient batches (if
|
|
any - of course it's best when all recipients are read in one batch)
|
|
is based solely on the position of the message jobs on their
|
|
corresponding transports' job lists. Each transport has a pool of
|
|
<i>transport</i>_recipient_limit recipient slots which it can
|
|
distribute among its jobs (how this is done is described later).
|
|
The subsequent recipient batch may be as large as the sum of all
|
|
recipient slots of all jobs of the message permits (plus the
|
|
qmgr_message_recipient_minimum amount which always applies).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
For example, if a message has three jobs, the first with 1 recipient
|
|
still in-core and 4 recipient slots, the second with 5 recipients in-core
|
|
and 5 recipient slots, and the third with 2 recipients in-core and 0
|
|
recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs'
|
|
recipients slots in total. This means that we could immediately
|
|
read 2+qmgr_message_recipient_minimum more recipients of that message
|
|
in core.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The above example illustrates several things which might be worth
|
|
mentioning explicitly: first, note that although the per-transport
|
|
slots are assigned to particular jobs, we can't guarantee that once
|
|
the next batch of recipients is read in core, that the corresponding
|
|
amounts of recipients will be assigned to those jobs. The jobs lend
|
|
its slots to the message as a whole, so it is possible that some
|
|
jobs end up sponsoring other jobs of their message. For example,
|
|
if in the example above the 2 newly read recipients were assigned
|
|
to the second job, the first job sponsored the second job with 2
|
|
slots. The second notable thing is the third job, which has more
|
|
recipients in-core than it has slots. Apart from sponsoring by other
|
|
job we just saw it can be result of the first recipient batch, which
|
|
is sponsored from global recipient pool of qmgr_message_recipient_limit
|
|
recipients. It can be also sponsored from the message recipient
|
|
pool of qmgr_message_recipient_minimum recipients.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Now how does each transport distribute the recipient slots among
|
|
its jobs? The strategy is quite simple. As most scheduler activity
|
|
happens on the head of the job list, it is our intention to make
|
|
sure that the scheduler has the best estimates of the recipient
|
|
counts for those jobs. As we mentioned above, this means that we
|
|
want to try to make sure that the messages of those jobs have all
|
|
recipients read in-core. Therefore the transport distributes the
|
|
slots "along" the job list from start to end. In this case the job
|
|
list sorted by message enqueue time is used, because it doesn't
|
|
change over time as the scheduler's job list does.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
More specifically, each time a job is created and appended to the
|
|
job list, it gets all unused recipient slots from its transport's
|
|
pool. It keeps them until all recipients of its message are read.
|
|
When this happens, all unused recipient slots are transferred to
|
|
the next job (which is now in fact the first such job) on the job
|
|
list which still has some recipients unread, or eventually back to
|
|
the transport pool if there is no such job. Such a transfer then also
|
|
happens whenever a recipient entry of that job is delivered.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
There is also a scenario when a job is not appended to the end of
|
|
the job list (for example it was created as a result of a second or
|
|
later recipient batch). Then it works exactly as above, except that
|
|
if it was put in front of the first unread job (that is, the job
|
|
of a message which still has some unread recipients in the queue file),
|
|
that job is first forced to return all of its unused recipient slots
|
|
to the transport pool.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The algorithm just described leads to the following state: The first
|
|
unread job on the job list always gets all the remaining recipient
|
|
slots of that transport (if there are any). The jobs queued before
|
|
this job are completely read (that is, all recipients of their
|
|
message were already read in core) and have at maximum as many slots
|
|
as they still have recipients in-core (the maximum is there because
|
|
of the sponsoring mentioned before) and the jobs after this job get
|
|
nothing from the transport recipient pool (unless they got something
|
|
before and then the first unread job was created and enqueued in
|
|
front of them later - in such a case, they also get at maximum as many
|
|
slots as they have recipients in-core).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Things work fine in such a state for most of the time, because the
|
|
current job is either completely read in-core or has as many recipient
|
|
slots as there are, but there is one situation which we still have
|
|
to take care of specially. Imagine if the current job is preempted
|
|
by some unread job from the job list and there are no more recipient
|
|
slots available, so this new current job could read only batches
|
|
of qmgr_message_recipient_minimum recipients at a time. This would
|
|
really degrade performance. For this reason, each transport has an
|
|
extra pool of <i>transport</i>_extra_recipient_limit recipient
|
|
slots, dedicated exactly for this situation. Each time an unread
|
|
job preempts the current job, it gets half of the remaining recipient
|
|
slots from the normal pool and this extra pool.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
And that's it. It sure does sound pretty complicated, but fortunately
|
|
most people don't really have to care exactly how it works as long
|
|
as it works. Perhaps the only important things to know for most
|
|
people are the following upper bound formulas:
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Each transport has at maximum
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
max(
|
|
qmgr_message_recipient_minimum * qmgr_message_active_limit
|
|
+ *_recipient_limit + *_extra_recipient_limit,
|
|
qmgr_message_recipient_limit
|
|
)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
recipients in core.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The total amount of recipients in core is
|
|
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
max(
|
|
qmgr_message_recipient_minimum * qmgr_message_active_limit
|
|
+ sum( *_recipient_limit + *_extra_recipient_limit ),
|
|
qmgr_message_recipient_limit
|
|
)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
|
|
where the sum is over all used transports.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
And this terribly complicated chapter concludes the documentation
|
|
of the <tt>nqmgr</tt> scheduler.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
[By now you should theoretically know the <tt>nqmgr</tt> scheduler
|
|
inside out. In practice, you still hope that you will never have
|
|
to really understand the last or last two chapters completely, and
|
|
fortunately most people really won't. Understanding how the scheduler
|
|
works in ideal conditions is more than good enough for the vast majority
|
|
of users.]
|
|
|
|
</p>
|
|
|
|
<h2> <a name="credits"> Credits </a> </h2>
|
|
|
|
<ul>
|
|
|
|
<li> Wietse Venema designed and implemented the initial queue manager
|
|
with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
|
|
feedback.
|
|
|
|
<li> Patrik Rak designed and implemented preemption where mail with
|
|
fewer recipients can slip past mail with more recipients in a
|
|
controlled manner, and wrote up its documentation.
|
|
|
|
<li> Wietse Venema initiated a discussion with Patrik Rak and Victor
|
|
Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
|
|
behavior. This is when K/N feedback was reviewed (N = concurrency).
|
|
The discussion ended without a good solution for both negative
|
|
feedback and dead site detection.
|
|
|
|
<li> Victor Duchovni resumed work on concurrency feedback in the
|
|
context of concurrency-limited servers.
|
|
|
|
<li> Wietse Venema then re-designed the concurrency scheduler in
|
|
terms of the simplest possible concepts: less-than-1 concurrency
|
|
feedback per delivery, forward and reverse concurrency feedback
|
|
hysteresis, and pseudo-cohort failure. At this same time, concurrency
|
|
feedback was separated from dead site detection.
|
|
|
|
<li> These simplifications, and their modular implementation, helped
|
|
to develop further insights into the different roles that positive
|
|
and negative concurrency feedback play, and helped to identify some
|
|
worst-case scenarios.
|
|
|
|
</ul>
|
|
|
|
</body>
|
|
|
|
</html>
|