summaryrefslogtreecommitdiffstats
path: root/proto/SCHEDULER_README.html
diff options
context:
space:
mode:
Diffstat (limited to 'proto/SCHEDULER_README.html')
-rw-r--r--proto/SCHEDULER_README.html1839
1 files changed, 1839 insertions, 0 deletions
diff --git a/proto/SCHEDULER_README.html b/proto/SCHEDULER_README.html
new file mode 100644
index 0000000..d80b8f4
--- /dev/null
+++ b/proto/SCHEDULER_README.html
@@ -0,0 +1,1839 @@
+<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
+ "http://www.w3.org/TR/html4/loose.dtd">
+
+<html>
+
+<head>
+
+<title>Postfix Queue Scheduler</title>
+
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+
+</head>
+
+<body>
+
+<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
+Queue Scheduler</h1>
+
+<hr>
+
+<h2> Disclaimer </h2>
+
+<p> Many of the <i>transport</i>-specific configuration parameters
+discussed in this document will not show up in "postconf" command
+output before Postfix version 2.9. This limitation applies to many
+parameters whose name is a combination of a master.cf service name
+such as "relay" and a built-in suffix such as
+"_destination_concurrency_limit". </p>
+
+<h2> Overview </h2>
+
+<p> The queue manager is by far the most complex part of the Postfix
+mail system. It schedules delivery of new mail, retries failed
+deliveries at specific times, and removes mail from the queue after
+the last delivery attempt. There are two major classes of mechanisms
+that control the operation of the queue manager. </p>
+
+<p> Topics covered by this document: </p>
+
+<ul>
+
+<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
+with the number of concurrent deliveries to a specific destination,
+including decisions on when to suspend deliveries after persistent
+failures.
+
+<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
+the selection of email messages and recipients for a given destination.
+
+<li> <a href="#credits"> Credits</a>, something this document would not be
+complete without.
+
+</ul>
+
+<!--
+
+<p> Once started, the qmgr(8) process runs until "postfix reload"
+or "postfix stop". As a persistent process, the queue manager has
+to meet strict requirements with respect to code correctness and
+robustness. Unlike non-persistent daemon processes, the queue manager
+cannot benefit from Postfix's process rejuvenation mechanism that
+limit the impact from resource leaks and other coding errors
+(translation: replacing a process after a short time covers up bugs
+before they can become a problem). </p>
+
+-->
+
+<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
+
+<p> The following sections document the Postfix 2.5 concurrency
+scheduler, after a discussion of the limitations of the earlier
+concurrency scheduler. This is followed by results of medium-concurrency
+experiments, and a discussion of trade-offs between performance and
+robustness. </p>
+
+<p> The material is organized as follows: </p>
+
+<ul>
+
+<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a>
+
+<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a>
+
+<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
+destination" detection algorithm </a>
+
+<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
+concurrency scheduler </a>
+
+<li> <a href="#concurrency_results"> Results for delivery to
+concurrency limited servers </a>
+
+<li> <a href="#concurrency_discussion"> Discussion of concurrency
+limited server results </a>
+
+<li> <a href="#concurrency_limitations"> Limitations of less-than-1
+per delivery feedback </a>
+
+<li> <a href="#concurrency_config"> Concurrency configuration
+parameters </a>
+
+</ul>
+
+<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a> </h3>
+
+<p> From the start, Postfix has used a simple but robust algorithm
+where the per-destination delivery concurrency is decremented by 1
+after delivery failed due to connection or handshake failure, and
+incremented by 1 otherwise. Of course the concurrency is never
+allowed to exceed the maximum per-destination concurrency limit.
+And when a destination's concurrency level drops to zero, the
+destination is declared "dead" and delivery is suspended. </p>
+
+<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
+
+<ul>
+
+<li> <p> Overshoot due to exponential delivery concurrency growth
+with each pseudo-cohort(*). This can be an issue with high-concurrency
+channels. For example, with the default initial concurrency of 5,
+concurrency would proceed over time as (5-10-20). </p>
+
+<li> <p> Throttling down to zero concurrency after a single
+pseudo-cohort(*) failure. This was especially an issue with
+low-concurrency channels where a single failure could be sufficient
+to mark a destination as "dead", causing the suspension of further
+deliveries to the affected destination. </p>
+
+</ul>
+
+<p> (*) A pseudo-cohort is a number of delivery requests equal to
+a destination's delivery concurrency. </p>
+
+<p> The revised concurrency scheduler has a highly modular structure.
+It uses separate mechanisms for per-destination concurrency control
+and for "dead destination" detection. The concurrency control in
+turn is built from two separate mechanisms: it supports less-than-1
+feedback per delivery to allow for more gradual concurrency
+adjustments, and it uses feedback hysteresis to suppress concurrency
+oscillations. And instead of waiting for delivery concurrency to
+throttle down to zero, a destination is declared "dead" after a
+configurable number of pseudo-cohorts reports connection or handshake
+failure. </p>
+
+<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a> </h3>
+
+<p> We want to increment a destination's delivery concurrency when
+some (not necessarily consecutive) number of deliveries complete
+without connection or handshake failure. This is implemented with
+positive feedback g(N) where N is the destination's delivery
+concurrency. With g(N)=1 feedback per delivery, concurrency increases
+by 1 after each positive feedback event; this gives us the old
+scheduler's exponential growth in time. With g(N)=1/N feedback per
+delivery, concurrency increases by 1 after an entire pseudo-cohort
+N of positive feedback reports; this gives us linear growth in time.
+Less-than-1 feedback per delivery and integer truncation naturally
+give us hysteresis, so that transitions to larger concurrency happen
+every 1/g(N) positive feedback events. </p>
+
+<p> We want to decrement a destination's delivery concurrency when
+some (not necessarily consecutive) number of deliveries complete
+after connection or handshake failure. This is implemented with
+negative feedback f(N) where N is the destination's delivery
+concurrency. With f(N)=1 feedback per delivery, concurrency decreases
+by 1 after each negative feedback event; this gives us the old
+scheduler's behavior where concurrency is throttled down dramatically
+after a single pseudo-cohort failure. With f(N)=1/N feedback per
+delivery, concurrency backs off more gently. Again, less-than-1
+feedback per delivery and integer truncation naturally give us
+hysteresis, so that transitions to lower concurrency happen every
+1/f(N) negative feedback events. </p>
+
+<p> However, with negative feedback we introduce a subtle twist.
+We "reverse" the negative hysteresis cycle so that the transition
+to lower concurrency happens at the <b>beginning</b> of a sequence
+of 1/f(N) negative feedback events. Otherwise, a correction for
+overload would be made too late. This makes the choice of f(N)
+relatively unimportant, as borne out by measurements later in this
+document. </p>
+
+<p> In summary, the main ingredients for the Postfix 2.5 concurrency
+feedback algorithm are a) the option of less-than-1 positive feedback
+per delivery to avoid overwhelming servers, b) the option of
+less-than-1 negative feedback per delivery to avoid giving up too
+fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
+"reverse" hysteresis cycle for negative feedback, so that it can
+correct for overload quickly. </p>
+
+<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
+
+<p> We want to suspend deliveries to a specific destination after
+some number of deliveries suffers connection or handshake failure.
+The old scheduler declares a destination "dead" when negative (-1)
+feedback throttles the delivery concurrency down to zero. With
+less-than-1 feedback per delivery, this throttling down would
+obviously take too long. We therefore have to separate "dead
+destination" detection from concurrency feedback. This is implemented
+by introducing the concept of pseudo-cohort failure. The Postfix
+2.5 concurrency scheduler declares a destination "dead" after a
+configurable number of pseudo-cohorts suffers from connection or
+handshake failures. The old scheduler corresponds to the special
+case where the pseudo-cohort failure limit is equal to 1. </p>
+
+<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
+
+<p> The pseudo code shows how the ideas behind new concurrency
+scheduler are implemented as of November 2007. The actual code can
+be found in the module qmgr/qmgr_queue.c. </p>
+
+<pre>
+Types:
+ Each destination has one set of the following variables
+ int concurrency
+ double success
+ double failure
+ double fail_cohorts
+
+Feedback functions:
+ N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
+ positive feedback: g(N) = x/N | x/sqrt(N) | x
+ negative feedback: f(N) = y/N | y/sqrt(N) | y
+
+Initialization:
+ concurrency = initial_concurrency
+ success = 0
+ failure = 0
+ fail_cohorts = 0
+
+After success:
+ fail_cohorts = 0
+ Be prepared for feedback &gt; hysteresis, or rounding error
+ success += g(concurrency)
+ while (success >= 1) Hysteresis 1
+ concurrency += 1 Hysteresis 1
+ failure = 0
+ success -= 1 Hysteresis 1
+ Be prepared for overshoot
+ if (concurrency &gt; concurrency limit)
+ concurrency = concurrency limit
+
+Safety:
+ Don't apply positive feedback unless
+ concurrency &lt; busy_refcount + init_dest_concurrency
+ otherwise negative feedback effect could be delayed
+
+After failure:
+ if (concurrency &gt; 0)
+ fail_cohorts += 1.0 / concurrency
+ if (fail_cohorts &gt; cohort_failure_limit)
+ concurrency = 0
+ if (concurrency &gt; 0)
+ Be prepared for feedback &gt; hysteresis, rounding errors
+ failure -= f(concurrency)
+ while (failure &lt; 0)
+ concurrency -= 1 Hysteresis 1
+ failure += 1 Hysteresis 1
+ success = 0
+ Be prepared for overshoot
+ if (concurrency &lt; 1)
+ concurrency = 1
+</pre>
+
+<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
+
+<p> Discussions about the concurrency scheduler redesign started
+early 2004, when the primary goal was to find alternatives that did
+not exhibit exponential growth or rapid concurrency throttling. No
+code was implemented until late 2007, when the primary concern had
+shifted towards better handling of server concurrency limits. For
+this reason we measure how well the new scheduler does this
+job. The table below compares mail delivery performance of the old
++/-1 feedback per delivery with several less-than-1 feedback
+functions, for different limited-concurrency server scenarios.
+Measurements were done with a FreeBSD 6.2 client and with FreeBSD
+6.2 and various Linux servers. </p>
+
+<p> Server configuration: </p>
+
+<ul> <li> The mail flow was slowed down with 1 second latency per
+recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
+to make results less dependent on hardware details, by avoiding
+slow-downs by queue file I/O, logging I/O, and network I/O.
+
+<li> Concurrency was limited by the server process limit
+("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
+= static:all"). Postfix was stopped and started after changing the
+process limit, because the same number is also used as the backlog
+argument to the listen(2) system call, and "postfix reload" does
+not re-issue this call.
+
+<li> Mail was discarded with "local_recipient_maps = static:all" and
+"local_transport = discard". The discard action in access maps or
+header/body checks
+could not be used as it fails to update the in_flow_delay counters.
+
+</ul>
+
+<p> Client configuration: </p>
+
+<ul>
+
+<li> Queue file overhead was minimized by sending one message to a
+virtual alias that expanded into 2000 different remote recipients.
+All recipients were accounted for according to the maillog file.
+The virtual_alias_expansion_limit setting was increased to avoid
+complaints from the cleanup(8) server.
+
+<li> The number of deliveries was maximized with
+"smtp_destination_recipient_limit = 2". A smaller limit would cause
+Postfix to schedule the concurrency per recipient instead of domain,
+which is not what we want.
+
+<li> Maximum concurrency was limited with
+"smtp_destination_concurrency_limit = 20", and
+initial_destination_concurrency was set to the same value.
+
+<li> The positive and negative concurrency feedback hysteresis was
+1. Concurrency was incremented by 1 at the END of 1/feedback steps
+of positive feedback, and was decremented by 1 at the START of
+1/feedback steps of negative feedback.
+
+<li> The SMTP client used the default 30s SMTP connect timeout and
+300s SMTP greeting timeout.
+
+</ul>
+
+<h4> Impact of the 30s SMTP connect timeout </h4>
+
+<p> The first results are for a FreeBSD 6.2 server, where our
+artificially low listen(2) backlog results in a very short kernel
+queue for established connections. The table shows that all deferred
+deliveries failed due to a 30s connection timeout, and none failed
+due to a server greeting timeout. This measurement simulates what
+happens when the server's connection queue is completely full under
+load, and the TCP engine drops new connections. </p>
+
+<blockquote>
+
+<table>
+
+<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
+style</th> <th>connection<br> caching</th> <th>percentage<br>
+deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
+<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">no</td> <td
+align="center">9.9</td> <td align="center">19.4</td> <td
+align="center">0.49</td> <td align="center">198</td> <td
+align="center">-</td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">yes</td> <td
+align="center">10.3</td> <td align="center">19.4</td> <td
+align="center">0.49</td> <td align="center">206</td> <td
+align="center">-</td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">no</td>
+<td align="center">10.4</td> <td align="center">19.6</td> <td
+align="center">0.59</td> <td align="center">208</td> <td
+align="center">-</td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">yes</td>
+<td align="center">10.6</td> <td align="center">19.6</td> <td
+align="center">0.61</td> <td align="center">212</td> <td
+align="center">-</td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">no</td> <td
+align="center">10.1</td> <td align="center">19.5</td> <td
+align="center">1.29</td> <td align="center">202</td> <td
+align="center">-</td> </tr>
+
+<tr><td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">yes</td> <td
+align="center">10.8</td> <td align="center">19.3</td> <td
+align="center">1.57</td> <td align="center">216</td> <td
+align="center">-</td> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+</table>
+
+<p> A busy server with a completely full connection queue. N is
+the client delivery concurrency. Failed deliveries time out after
+30s without completing the TCP handshake. See text for a discussion
+of results. </p>
+
+</blockquote>
+
+<h4> Impact of the 300s SMTP greeting timeout </h4>
+
+<p> The next table shows results for a Fedora Core 8 server (results
+for RedHat 7.3 are identical). In this case, the artificially small
+listen(2) backlog argument does not impact our measurement. The
+table shows that practically all deferred deliveries fail after the
+300s SMTP greeting timeout. As these timeouts were 10x longer than
+with the first measurement, we increased the recipient count (and
+thus the running time) by a factor of 10 to keep the results
+comparable. The deferred mail percentages are a factor 10 lower
+than with the first measurement, because the 1s per-recipient delay
+was 1/300th of the greeting timeout instead of 1/30th of the
+connection timeout. </p>
+
+<blockquote>
+
+<table>
+
+<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
+style</th> <th>connection<br> caching</th> <th>percentage<br>
+deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
+<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">no</td> <td
+align="center">1.16</td> <td align="center">19.8</td> <td
+align="center">0.37</td> <td align="center">-</td> <td
+align="center">230</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">yes</td> <td
+align="center">1.36</td> <td align="center">19.8</td> <td
+align="center">0.36</td> <td align="center">-</td> <td
+align="center">272</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">no</td>
+<td align="center">1.21</td> <td align="center">19.9</td> <td
+align="center">0.23</td> <td align="center">4</td> <td
+align="center">238</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">yes</td>
+<td align="center">1.36</td> <td align="center">20.0</td> <td
+align="center">0.23</td> <td align="center">-</td> <td
+align="center">272</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">no</td> <td
+align="center">1.18</td> <td align="center">20.0</td> <td
+align="center">0.16</td> <td align="center">-</td> <td
+align="center">236</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">yes</td> <td
+align="center">1.39</td> <td align="center">20.0</td> <td
+align="center">0.16</td> <td align="center">-</td> <td
+align="center">278</td> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+</table>
+
+<p> A busy server with a non-full connection queue. N is the client
+delivery concurrency. Failed deliveries complete at the TCP level,
+but time out after 300s while waiting for the SMTP greeting. See
+text for a discussion of results. </p>
+
+</blockquote>
+
+<h4> Impact of active server concurrency limiter </h4>
+
+<p> The final concurrency-limited result shows what happens when
+SMTP connections don't time out, but are rejected immediately with
+the Postfix server's smtpd_client_connection_count_limit feature
+(the server replies with a 421 status and disconnects immediately).
+Similar results can be expected with concurrency limiting features
+built into other MTAs or firewalls. For this measurement we specified
+a server concurrency limit and a client initial destination concurrency
+of 5, and a server process limit of 10; all other conditions were
+the same as with the first measurement. The same result would be
+obtained with a FreeBSD or Linux server, because the "pushing back"
+is done entirely by the receiving side. </p>
+
+<blockquote>
+
+<table>
+
+<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
+style</th> <th>connection<br> caching</th> <th>percentage<br>
+deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
+<th>theoretical<br>defer rate</th> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">no</td> <td
+align="center">16.5</td> <td align="center">5.17</td> <td
+align="center">0.38</td> <td align="center">1/6</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/N</td> <td align="center">yes</td> <td
+align="center">16.5</td> <td align="center">5.17</td> <td
+align="center">0.38</td> <td align="center">1/6</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">no</td>
+<td align="center">24.5</td> <td align="center">5.28</td> <td
+align="center">0.45</td> <td align="center">1/4</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1/sqrt(N)</td> <td align="center">yes</td>
+<td align="center">24.3</td> <td align="center">5.28</td> <td
+align="center">0.46</td> <td align="center">1/4</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">no</td> <td
+align="center">49.7</td> <td align="center">5.63</td> <td
+align="center">0.67</td> <td align="center">1/2</td> </tr>
+
+<tr> <td align="center">20</td> <td align="center">5</td> <td
+align="center">1</td> <td align="center">yes</td> <td
+align="center">49.7</td> <td align="center">5.68</td> <td
+align="center">0.70</td> <td align="center">1/2</td> </tr>
+
+<tr> <td align="center" colspan="9"> <hr> </td> </tr>
+
+</table>
+
+<p> A server with active per-client concurrency limiter that replies
+with 421 and disconnects. N is the client delivery concurrency.
+The theoretical defer rate is 1/(1+roundup(1/feedback)). This is
+always 1/2 with the fixed +/-1 feedback per delivery; with the
+concurrency-dependent feedback variants, the defer rate decreases
+with increasing concurrency. See text for a discussion of results.
+</p>
+
+</blockquote>
+
+<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
+
+<p> All results in the previous sections are based on the first
+delivery runs only; they do not include any second etc. delivery
+attempts. It's also worth noting that the measurements look at
+steady-state behavior only. They don't show what happens when the
+client starts sending at a much higher or lower concurrency.
+</p>
+
+<p> The first two examples show that the effect of feedback
+is negligible when concurrency is limited due to congestion. This
+is because the initial concurrency is already at the client's
+concurrency maximum, and because there is 10-100 times more positive
+than negative feedback. Under these conditions, it is no surprise
+that the contribution from SMTP connection caching is also negligible.
+</p>
+
+<p> In the last example, the old +/-1 feedback per delivery will
+defer 50% of the mail when confronted with an active (anvil-style)
+server concurrency limit, where the server hangs up immediately
+with a 421 status (a TCP-level RST would have the same result).
+Less aggressive feedback mechanisms fare better than more aggressive
+ones. Concurrency-dependent feedback fares even better at higher
+concurrencies than shown here, but has limitations as discussed in
+the next section. </p>
+
+<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
+
+<p> Less-than-1 feedback is of interest primarily when sending large
+amounts of mail to destinations with active concurrency limiters
+(servers that reply with 421, or firewalls that send RST). When
+sending small amounts of mail per destination, less-than-1 per-delivery
+feedback won't have a noticeable effect on the per-destination
+concurrency, because the number of deliveries to the same destination
+is too small. You might just as well use zero per-delivery feedback
+and stay with the initial per-destination concurrency. And when
+mail deliveries fail due to congestion instead of active concurrency
+limiters, the measurements above show that per-delivery feedback
+has no effect. With large amounts of mail you might just as well
+use zero per-delivery feedback and start with the maximal per-destination
+concurrency. </p>
+
+<p> The scheduler with less-than-1 concurrency
+feedback per delivery solves a problem with servers that have active
+concurrency limiters. This works only because feedback is handled
+in a peculiar manner: positive feedback will increment the concurrency
+by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
+while negative feedback will decrement concurrency by 1 at the
+<b>beginning</b> of such a sequence. This is how Postfix adjusts
+quickly for overshoot without causing lots of mail to be deferred.
+Without this difference in feedback treatment, less-than-1 feedback
+per delivery would defer 50% of the mail, and would be no better
+in this respect than the old +/-1 feedback per delivery. </p>
+
+<p> Unfortunately, the same feature that corrects quickly for
+concurrency overshoot also makes the scheduler more sensitive for
+noisy negative feedback. The reason is that one lonely negative
+feedback event has the same effect as a complete sequence of length
+1/feedback: in both cases delivery concurrency is dropped by 1
+immediately. As a worst-case scenario, consider multiple servers
+behind a load balancer on a single IP address, and no backup MX
+address. When 1 out of K servers fails to complete the SMTP handshake
+or drops the connection, a scheduler with 1/N (N = concurrency)
+feedback stops increasing its concurrency once it reaches a concurrency
+level of about K, even though the good servers behind the load
+balancer are perfectly capable of handling more traffic. </p>
+
+<p> This noise problem gets worse as the amount of positive feedback
+per delivery gets smaller. A compromise is to use fixed less-than-1
+positive feedback values instead of concurrency-dependent positive
+feedback. For example, to tolerate 1 of 4 bad servers in the above
+load balancer scenario, use positive feedback of 1/4 per "good"
+delivery (no connect or handshake error), and use an equal or smaller
+amount of negative feedback per "bad" delivery. The downside of
+using concurrency-independent feedback is that some of the old +/-1
+feedback problems will return at large concurrencies. Sites that
+must deliver mail at non-trivial per-destination concurrencies will
+require special configuration. </p>
+
+<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
+
+<p> The Postfix 2.5 concurrency scheduler is controlled with the
+following configuration parameters, where "<i>transport</i>_foo"
+provides a transport-specific parameter override. All parameter
+default settings are compatible with earlier Postfix versions. </p>
+
+<blockquote>
+
+<table border="0">
+
+<tr> <th> Parameter name </th> <th> Postfix version </th> <th>
+Description </th> </tr>
+
+<tr> <td colspan="3"> <hr> </td> </tr>
+
+<tr> <td> initial_destination_concurrency<br>
+<i>transport</i>_initial_destination_concurrency </td> <td
+align="center"> all<br> 2.5 </td> <td> Initial per-destination
+delivery concurrency </td> </tr>
+
+<tr> <td> default_destination_concurrency_limit<br>
+<i>transport</i>_destination_concurrency_limit </td> <td align="center">
+all<br> all </td> <td> Maximum per-destination delivery concurrency
+</td> </tr>
+
+<tr> <td> default_destination_concurrency_positive_feedback<br>
+<i>transport</i>_destination_concurrency_positive_feedback </td>
+<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
+feedback amount, per delivery that does not fail with connection
+or handshake failure </td> </tr>
+
+<tr> <td> default_destination_concurrency_negative_feedback<br>
+<i>transport</i>_destination_concurrency_negative_feedback </td>
+<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
+feedback amount, per delivery that fails with connection or handshake
+failure </td> </tr>
+
+<tr> <td> default_destination_concurrency_failed_cohort_limit<br>
+<i>transport</i>_destination_concurrency_failed_cohort_limit </td>
+<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
+pseudo-cohorts after which a destination is declared "dead" and
+delivery is suspended </td> </tr>
+
+<tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
+2.5 </td> <td> Enable verbose logging of concurrency scheduler
+activity </td> </tr>
+
+<tr> <td colspan="3"> <hr> </td> </tr>
+
+</table>
+
+</blockquote>
+
+<h2> <a name="jobs"> Preemptive scheduling </a> </h2>
+
+<p>
+
+The following sections describe the new queue manager and its
+preemptive scheduler algorithm. Note that the document was originally
+written to describe the changes between the new queue manager (in
+this text referred to as <tt>nqmgr</tt>, the name it was known by
+before it became the default queue manager) and the old queue manager
+(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
+every so often.
+
+</p>
+
+<p>
+
+This document is divided into sections as follows:
+
+</p>
+
+<ul>
+
+<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a>
+
+<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> - how it is assigned to transports, jobs, peers,
+entries
+
+<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
+works </a>
+
+<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
+works </a> - what messages may be preempted and how and what messages
+are chosen to preempt them
+
+<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a>
+
+<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a>
+
+</ul>
+
+<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a> </h3>
+
+<p>
+
+Let's start by recapitulating the structures and terms used when
+referring to the queue manager and how it operates. Many of these are
+partially described elsewhere, but it is nice to have a coherent
+overview in one place:
+
+</p>
+
+<ul>
+
+<li> <p> Each message structure represents one mail message which
+Postfix is to deliver. The message recipients specify to what
+destinations is the message to be delivered and what transports are
+going to be used for the delivery. </p>
+
+<li> <p> Each recipient entry groups a batch of recipients of one
+message which are all going to be delivered to the same destination
+(and over the same transport).
+</p>
+
+<li> <p> Each transport structure groups everything what is going
+to be delivered by delivery agents dedicated for that transport.
+Each transport maintains a set of queues (describing the destinations
+it shall talk to) and jobs (referencing the messages it shall
+deliver). </p>
+
+<li> <p> Each transport queue (not to be confused with the on-disk
+"active" queue or "incoming" queue) groups everything what is going be
+delivered to given destination (aka nexthop) by its transport. Each
+queue belongs to one transport, so each destination may be referred
+to by several queues, one for each transport. Each queue maintains
+a list of all recipient entries (batches of message recipients)
+which shall be delivered to given destination (the todo list), and
+a list of recipient entries already being delivered by the delivery
+agents (the busy list). </p>
+
+<li> <p> Each queue corresponds to multiple peer structures. Each
+peer structure is like the queue structure, belonging to one transport
+and referencing one destination. The difference is that it lists
+only the recipient entries which all originate from the same message,
+unlike the queue structure, whose entries may originate from various
+messages. For messages with few recipients, there is usually just
+one recipient entry for each destination, resulting in one recipient
+entry per peer. But for large mailing list messages the recipients
+may need to be split to multiple recipient entries, in which case
+the peer structure may list many entries for single destination.
+</p>
+
+<li> <p> Each transport job groups everything it takes to deliver
+one message via its transport. Each job represents one message
+within the context of the transport. The job belongs to one transport
+and message, so each message may have multiple jobs, one for each
+transport. The job groups all the peer structures, which describe
+the destinations the job's message has to be delivered to. </p>
+
+</ul>
+
+<p>
+
+The first four structures are common to both <tt>nqmgr</tt> and
+<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
+
+</p>
+
+<p>
+
+These terms are used extensively in the text below, feel free to
+look up the description above anytime you'll feel you have lost a
+sense what is what.
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> </h3>
+
+<p>
+
+Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue,
+the following happens: It reads all necessary information from the
+queue file as <tt>oqmgr</tt> does, and also reads as many recipients
+as possible - more on that later, for now let's just pretend it
+always reads all recipients.
+
+</p>
+
+<p>
+
+Then it resolves the recipients as <tt>oqmgr</tt> does, which
+means obtaining (address, nexthop, transport) triple for each
+recipient. For each triple, it finds the transport; if it does not
+exist yet, it instantiates it (unless it's dead). Within the
+transport, it finds the destination queue for the given nexthop; if it
+does not exist yet, it instantiates it (unless it's dead). The
+triple is then bound to given destination queue. This happens in
+qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+Then for each triple which was bound to some queue (and thus
+transport), the program finds the job which represents the message
+within that transport's context; if it does not exist yet, it
+instantiates it. Within the job, it finds the peer which represents
+the bound destination queue within this jobs context; if it does
+not exist yet, it instantiates it. Finally, it stores the address
+from the resolved triple to the recipient entry which is appended
+to both the queue entry list and the peer entry list. The addresses
+for the same nexthop are batched in the entries up to the
+<i>transport</i>_destination_recipient_limit for that transport.
+This happens in qmgr_message_assign(), and apart
+from that it operates with job and peer structures, it is basically the
+same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+When the job is instantiated, it is enqueued on the transport's job
+list based on the time its message was picked up by <tt>nqmgr</tt>.
+For first batch of recipients this means it is appended to the end
+of the job list, but the ordering of the job list by the enqueue
+time is important as we will see shortly.
+
+</p>
+
+<p>
+
+[Now you should have a pretty good idea what the state of the
+<tt>nqmgr</tt> is after a couple of messages were picked up, and what the
+relation is between all those job, peer, queue and entry structures.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
+works </a> </h3>
+
+<p>
+
+Having prepared all those above mentioned structures, the task of
+the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
+one at a time and pass them to the delivery agent for corresponding
+transport. Now how does this work?
+
+</p>
+
+<p>
+
+The first approximation of the new scheduling algorithm is like this:
+
+</p>
+
+<blockquote>
+<pre>
+foreach transport (round-robin-by-transport)
+do
+ if transport busy continue
+ if transport process limit reached continue
+ foreach transport's job (in the order of the transport's job list)
+ do
+ foreach job's peer (round-robin-by-destination)
+ if peer-&gt;queue-&gt;concurrency &lt; peer-&gt;queue-&gt;window
+ return next peer entry.
+ done
+ done
+done
+</pre>
+</blockquote>
+
+<p>
+
+Now what is the "order of the transport's job list"? As we know
+already, the job list is by default kept in the order the message
+was picked up by the <tt>nqmgr</tt>. So by default we get the
+top-level round-robin transport, and within each transport we get
+the FIFO message delivery. The round-robin of the peers by the
+destination is perhaps of little importance in most real-life cases
+(unless the <i>transport</i>_destination_recipient_limit is reached,
+in one job there
+is only one peer structure for each destination), but theoretically
+it makes sure that even within single jobs, destinations are treated
+fairly.
+
+</p>
+
+<p>
+
+[By now you should have a feeling you really know how the scheduler
+works, except for the preemption, under ideal conditions - that is,
+no recipient resource limits and no destination concurrency problems.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
+works </a> </h3>
+
+<p>
+
+As you might perhaps expect by now, the transport's job list does
+not remain sorted by the job's message enqueue time all the time.
+The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
+delivery, but that it is able to slip mail with little recipients
+past the mailing-list bulk mail. This is what the job preemption
+is about - shuffling the jobs on the transport's job list to get
+the best message delivery rates. Now how is it achieved?
+
+</p>
+
+<p>
+
+First I have to tell you that there are in fact two job lists in
+each transport. One is the scheduler's job list, which the scheduler
+is free to play with, while the other one keeps the jobs always
+listed in the order of the enqueue time and is used for recipient
+pool management we will discuss later. For now, we will deal with
+the scheduler's job list only.
+
+</p>
+
+<p>
+
+So, we have the job list, which is first ordered by the time the
+jobs' messages were enqueued, oldest messages first, the most recently
+picked one at the end. For now, let's assume that there are no
+destination concurrency problems. Without preemption, we pick some
+entry of the first (oldest) job on the queue, assign it to delivery
+agent, pick another one from the same job, assign it again, and so
+on, until all the entries are used and the job is delivered. We
+would then move onto the next job and so on and on. Now how do we
+manage to sneak in some entries from the recently added jobs when
+the first job on the job list belongs to a message going to the
+mailing-list and has thousands of recipient entries?
+
+</p>
+
+<p>
+
+The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
+the delivery time of that first job by some constant for free - it
+is basically the same trick you might remember as "accumulation of
+potential" from the amortized complexity lessons. For example,
+instead of delivering the entries of the first job on the job list
+every time a delivery agent becomes available, we can do it only
+every second time. If you view the moments the delivery agent becomes
+available on a timeline as "delivery slots", then instead of using
+every delivery slot for the first job, we can use only every other
+slot, and still the overall delivery efficiency of the first job
+remains the same. So the delivery <tt>11112222</tt> becomes
+<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
+denotes the free slot). Now what do we do with free slots?
+
+</p>
+
+<p>
+
+As you might have guessed, we will use them for sneaking the mail
+with little recipients in. For example, if we have one four-recipient
+mail followed by four one recipients mail, the delivery sequence
+(that is, the sequence in which the jobs are assigned to the
+delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
+for sneaking in the single recipient mail, but how do we sneak in
+the mail with more than one recipient? Say if we have one four-recipient
+mail followed by two two-recipient mails?
+
+</p>
+
+<p>
+
+The simple answer would be to use delivery sequence <tt>12121313</tt>.
+But the problem is that this does not scale well. Imagine you have
+mail with a thousand recipients followed by mail with a hundred recipients.
+It is tempting to suggest the delivery sequence like <tt>121212....</tt>,
+but alas! Imagine there arrives another mail with say ten recipients.
+But there are no free slots anymore, so it can't slip by, not even
+if it had only one recipient. It will be stuck until the
+hundred-recipient mail is delivered, which really sucks.
+
+</p>
+
+<p>
+
+So, it becomes obvious that while inflating the message to get
+free slots is a great idea, one has to be really careful of how the
+free slots are assigned, otherwise one might corner himself. So,
+how does <tt>nqmgr</tt> really use the free slots?
+
+</p>
+
+<p>
+
+The key idea is that one does not have to generate the free slots
+in a uniform way. The delivery sequence <tt>111...1</tt> is no
+worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
+entries are in the first case selected earlier than in the second
+case, and none is selected later! So it is possible to first
+"accumulate" the free delivery slots and then use them all at once.
+It is even possible to accumulate some, then use them, then accumulate
+some more and use them again, as in <tt>11..1.1</tt> .
+
+</p>
+
+<p>
+
+Let's get back to the one hundred recipient example. We now know
+that we could first accumulate one hundred free slots, and only
+after then to preempt the first job and sneak the one hundred
+recipient mail in. Applying the algorithm recursively, we see the
+hundred recipient job can accumulate ten free delivery slots, and
+then we could preempt it and sneak in the ten-recipient mail...
+Wait wait wait! Could we? Aren't we overinflating the original one
+thousand recipient mail?
+
+</p>
+
+<p>
+
+Well, despite the fact that it looks so at the first glance, another trick will
+allow us to answer "no, we are not!". If we had said that we will
+inflate the delivery time twice at maximum, and then we consider
+every other slot as a free slot, then we would overinflate in case
+of the recursive preemption. BUT! The trick is that if we use only
+every n-th slot as a free slot for n&gt;2, there is always some worst
+inflation factor which we can guarantee not to be breached, even
+if we apply the algorithm recursively. To be precise, if for every
+k&gt;1 normally used slots we accumulate one free delivery slot, than
+the inflation factor is not worse than k/(k-1) no matter how many
+recursive preemptions happen. And it's not worse than (k+1)/k if
+only non-recursive preemption happens. Now, having got through the
+theory and the related math, let's see how <tt>nqmgr</tt> implements
+this.
+
+</p>
+
+<p>
+
+Each job has so called "available delivery slot" counter. Each
+transport has a <i>transport</i>_delivery_slot_cost parameter, which
+defaults to default_delivery_slot_cost parameter which is set to 5
+by default. This is the k from the paragraph above. Each time k
+entries of the job are selected for delivery, this counter is
+incremented by one. Once there are some slots accumulated, a job which
+requires no more than that number of slots to be fully delivered
+can preempt this job.
+
+</p>
+
+<p>
+
+[Well, the truth is, the counter is incremented every time an entry
+is selected and it is divided by k when it is used.
+But to understand, it's good enough to use
+the above approximation of the truth.]
+
+</p>
+
+<p>
+
+OK, so now we know the conditions which must be satisfied so one
+job can preempt another one. But what job gets preempted, how do
+we choose what job preempts it if there are several valid candidates,
+and when does all this exactly happen?
+
+</p>
+
+<p>
+
+The answer for the first part is simple. The job whose entry was
+selected the last time is the so called current job. Normally, it is
+the first job on the scheduler's job list, but destination concurrency
+limits may change this as we will see later. It is always only the
+current job which may get preempted.
+
+</p>
+
+<p>
+
+Now for the second part. The current job has a certain amount of
+recipient entries, and as such may accumulate at maximum some amount
+of available delivery slots. It might have already accumulated some,
+and perhaps even already used some when it was preempted before
+(remember a job can be preempted several times). In either case,
+we know how many are accumulated and how many are left to deliver,
+so we know how many it may yet accumulate at maximum. Every other
+job which may be delivered by less than that number of slots is a
+valid candidate for preemption. How do we choose among them?
+
+</p>
+
+<p>
+
+The answer is - the one with maximum enqueue_time/recipient_entry_count.
+That is, the older the job is, the more we should try to deliver
+it in order to get best message delivery rates. These rates are of
+course subject to how many recipients the message has, therefore
+the division by the recipient (entry) count. No one shall be surprised
+that a message with n recipients takes n times longer to deliver than
+a message with one recipient.
+
+</p>
+
+<p>
+
+Now let's recap the previous two paragraphs. Isn't it too complicated?
+Why don't the candidates come only among the jobs which can be
+delivered within the number of slots the current job already
+accumulated? Why do we need to estimate how much it has yet to
+accumulate? If you found out the answer, congratulate yourself. If
+we did it this simple way, we would always choose the candidate
+with the fewest recipient entries. If there were enough single recipient
+mails coming in, they would always slip by the bulk mail as soon
+as possible, and the two or more recipients mail would never get
+a chance, no matter how long they have been sitting around in the
+job list.
+
+</p>
+
+<p>
+
+This candidate selection has an interesting implication - that when
+we choose the best candidate for preemption (this is done in
+qmgr_choose_candidate()), it may happen that we may not use it for
+preemption immediately. This leads to an answer to the last part
+of the original question - when does the preemption happen?
+
+</p>
+
+<p>
+
+The preemption attempt happens every time next transport's recipient
+entry is to be chosen for delivery. To avoid needless overhead, the
+preemption is not attempted if the current job could never accumulate
+more than <i>transport</i>_minimum_delivery_slots (defaults to
+default_minimum_delivery_slots which defaults to 3). If there are
+already enough accumulated slots to preempt the current job by the
+chosen best candidate, it is done immediately. This basically means
+that the candidate is moved in front of the current job on the
+scheduler's job list and decreasing the accumulated slot counter
+by the amount used by the candidate. If there are not enough slots...
+well, I could say that nothing happens and the another preemption
+is attempted the next time. But that's not the complete truth.
+
+</p>
+
+<p>
+
+The truth is that it turns out that it is not really necessary to
+wait until the jobs counter accumulates all the delivery slots in
+advance. Say we have ten-recipient mail followed by two two-recipient
+mails. If the preemption happened when enough delivery slots accumulate
+(assuming slot cost 2), the delivery sequence becomes
+<tt>11112211113311</tt>. Now what would we get if we would wait
+only for 50% of the necessary slots to accumulate and we promise
+we would wait for the remaining 50% later, after we get back
+to the preempted job? If we use such a slot loan, the delivery sequence
+becomes <tt>11221111331111</tt>. As we can see, it makes it not
+considerably worse for the delivery of the ten-recipient mail, but
+it allows the small messages to be delivered sooner.
+
+</p>
+
+<p>
+
+The concept of these slot loans is where the
+<i>transport</i>_delivery_slot_discount and
+<i>transport</i>_delivery_slot_loan come from (they default to
+default_delivery_slot_discount and default_delivery_slot_loan, whose
+values are by default 50 and 3, respectively). The discount (resp.
+loan) specifies how many percent (resp. how many slots) one "gets
+in advance", when the number of slots required to deliver the best
+candidate is compared with the number of slots the current slot had
+accumulated so far.
+
+</p>
+
+<p>
+
+And that pretty much concludes this chapter.
+
+</p>
+
+<p>
+
+[Now you should have a feeling that you pretty much understand the
+scheduler and the preemption, or at least that you will have
+after you read the last chapter a couple more times. You shall clearly
+see the job list and the preemption happening at its head, in ideal
+delivery conditions. The feeling of understanding shall last until
+you start wondering what happens if some of the jobs are blocked,
+which you might eventually figure out correctly from what had been
+said already. But I would be surprised if your mental image of the
+scheduler's functionality is not completely shattered once you
+start wondering how it works when not all recipients may be read
+in-core. More on that later.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a> </h3>
+
+<p>
+
+The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
+control as <tt>oqmgr</tt>. Now what happens when the destination
+limits are reached and no more entries for that destination may be
+selected by the scheduler?
+
+</p>
+
+<p>
+
+From the user's point of view it is all simple. If some of the peers
+of a job can't be selected, those peers are simply skipped by the
+entry selection algorithm (the pseudo-code described before) and
+only the selectable ones are used. If none of the peers may be
+selected, the job is declared a "blocker job". Blocker jobs are
+skipped by the entry selection algorithm and they are also excluded
+from the candidates for preemption of the current job. Thus the scheduler
+effectively behaves as if the blocker jobs didn't exist on the job
+list at all. As soon as at least one of the peers of a blocker job
+becomes unblocked (that is, the delivery agent handling the delivery
+of the recipient entry for the given destination successfully finishes),
+the job's blocker status is removed and the job again participates
+in all further scheduler actions normally.
+
+</p>
+
+<p>
+
+So the summary is that the users don't really have to be concerned
+about the interaction of the destination limits and scheduling
+algorithm. It works well on its own and there are no knobs they
+would need to control it.
+
+</p>
+
+<p>
+
+From a programmer's point of view, the blocker jobs complicate the
+scheduler quite a lot. Without them, the jobs on the job list would
+be normally delivered in strict FIFO order. If the current job is
+preempted, the job preempting it is completely delivered unless it
+is preempted itself. Without blockers, the current job is thus
+always either the first job on the job list, or the top of the stack
+of jobs preempting the first job on the job list.
+
+</p>
+
+<p>
+
+The visualization of the job list and the preemption stack without
+blockers would be like this:
+
+</p>
+
+<blockquote>
+<pre>
+first job-&gt; 1--2--3--5--6--8--... &lt;- job list
+on job list |
+ 4 &lt;- preemption stack
+ |
+current job-&gt; 7
+</pre>
+</blockquote>
+
+<p>
+
+In the example above we see that job 1 was preempted by job 4 and
+then job 4 was preempted by job 7. After job 7 is completed, remaining
+entries of job 4 are selected, and once they are all selected, job
+1 continues.
+
+</p>
+
+<p>
+
+As we see, it's all very clean and straightforward. Now how does
+this change because of blockers?
+
+</p>
+
+<p>
+
+The answer is: a lot. Any job may become a blocker job at any time,
+and also become a normal job again at any time. This has several
+important implications:
+
+</p>
+
+<ol>
+
+<li> <p>
+
+The jobs may be completed in arbitrary order. For example, in the
+example above, if the current job 7 becomes blocked, the next job
+4 may complete before the job 7 becomes unblocked again. Or if both
+7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
+and is completed, then 2 is completed and only after that 4 becomes
+unblocked and is completed... You get the idea.
+
+</p>
+
+<p>
+
+[Interesting side note: even when jobs are delivered out of order,
+from a single destination's point of view the jobs are still delivered
+in the expected order (that is, FIFO unless there was some preemption
+involved). This is because whenever a destination queue becomes
+unblocked (the destination limit allows selection of more recipient
+entries for that destination), all jobs which have peers for that
+destination are unblocked at once.]
+
+</p>
+
+<li> <p>
+
+The idea of the preemption stack at the head of the job list is
+gone. That is, it must be possible to preempt any job on the job
+list. For example, if the jobs 7, 4, 1 and 2 in the example above
+become all blocked, job 3 becomes the current job. And of course
+we do not want the preemption to be affected by the fact that there
+are some blocked jobs or not. Therefore, if it turns out that job
+3 might be preempted by job 6, the implementation shall make it
+possible.
+
+</p>
+
+<li> <p>
+
+The idea of the linear preemption stack itself is gone. It's no
+longer true that one job is always preempted by only one job at one
+time (that is directly preempted, not counting the recursively
+nested jobs). For example, in the example above, job 1 is directly
+preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
+blocked, and job 4 is being delivered. If it accumulates enough
+delivery slots, it is natural that it might be preempted for example
+by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
+same time.
+
+</p>
+
+</ol>
+
+<p>
+
+Now combine the points 2) and 3) with point 1) again and you realize
+that the relations on the once linear job list became pretty
+complicated. If we extend the point 3) example: jobs 7 and 8 preempt
+job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
+huh?
+
+</p>
+
+<p>
+
+If I illustrate the relations after the above mentioned examples
+(but those in point 1), the situation would look like this:
+
+</p>
+
+<blockquote>
+<pre>
+ v- parent
+
+adoptive parent -&gt; 1--2--3--5--... &lt;- "stack" level 0
+ | |
+parent gone -&gt; ? 6 &lt;- "stack" level 1
+ / \
+children -&gt; 7 8 ^- child &lt;- "stack" level 2
+
+ ^- siblings
+</pre>
+</blockquote>
+
+<p>
+
+Now how does <tt>nqmgr</tt> deal with all these complicated relations?
+
+</p>
+
+<p>
+
+Well, it maintains them all as described, but fortunately, all these
+relations are necessary only for the purpose of proper counting of
+available delivery slots. For the purpose of ordering the jobs for
+entry selection, the original rule still applies: "the job preempting
+the current job is moved in front of the current job on the job
+list". So for entry selection purposes, the job relations remain
+as simple as this:
+
+</p>
+
+<blockquote>
+<pre>
+7--8--1--2--6--3--5--.. &lt;- scheduler's job list order
+</pre>
+</blockquote>
+
+<p>
+
+The job list order and the preemption parent/child/siblings relations
+are maintained separately. And because the selection works only
+with the job list, you can happily forget about those complicated
+relations unless you want to study the <tt>nqmgr</tt> sources. In
+that case the text above might provide some helpful introduction
+to the problem domain. Otherwise I suggest you just forget about
+all this and stick with the user's point of view: the blocker jobs
+are simply ignored.
+
+</p>
+
+<p>
+
+[By now, you should have a feeling that there are more things going
+on under the hood than you ever wanted to know. You decide that
+forgetting about this chapter is the best you can do for the sake
+of your mind's health and you basically stick with the idea how the
+scheduler works in ideal conditions, when there are no blockers,
+which is good enough.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a> </h3>
+
+<p>
+
+When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
+that all recipients of all messages in the "active" queue are completely
+read into memory. This is simply not true. There is an upper
+bound on the amount of memory the <tt>nqmgr</tt> may use, and
+therefore it must impose some limits on the information it may store
+in memory at any given time.
+
+</p>
+
+<p>
+
+First of all, not all messages may be read in-core at once. At any
+time, only qmgr_message_active_limit messages may be read in-core
+at maximum. When read into memory, the messages are picked from the
+"incoming" and "deferred" queues and moved to the "active" queue
+(incoming having priority), so if there are more than
+qmgr_message_active_limit messages queued in the "active" queue, the
+rest will have to wait until (some of) the messages in the "active" queue
+are completely delivered (or deferred).
+
+</p>
+
+<p>
+
+Even with the limited amount of in-core messages, there is another
+limit which must be imposed in order to avoid memory exhaustion.
+Each message may contain a huge number of recipients (tens or hundreds
+of thousands are not uncommon), so if <tt>nqmgr</tt> read all
+recipients of all messages in the "active" queue, it may easily run
+out of memory. Therefore there must be some upper bound on the
+amount of message recipients which are read into memory at the
+same time.
+
+</p>
+
+<p>
+
+Before discussing how exactly <tt>nqmgr</tt> implements the recipient
+limits, let's see how the sole existence of the limits themselves
+affects the <tt>nqmgr</tt> and its scheduler.
+
+</p>
+
+<p>
+
+The message limit is straightforward - it just limits the size of
+the
+lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
+message can preempt the current one. Messages not in the "active" queue
+are simply not considered at all.
+
+</p>
+
+<p>
+
+The recipient limit complicates more things. First of all, the
+message reading code must support reading the recipients in batches,
+which among other things means accessing the queue file several
+times and continuing where the last recipient batch ended. This is
+invoked by the scheduler whenever the current job has space for more
+recipients, subject to transport's refill_limit and refill_delay parameters.
+It is also done any time when all
+in-core recipients of the message are dealt with (which may also
+mean they were deferred) but there are still more in the queue file.
+
+</p>
+
+<p>
+
+The second complication is that with some recipients left unread
+in the queue file, the scheduler can't operate with exact counts
+of recipient entries. With unread recipients, it is not clear how
+many recipient entries there will be, as they are subject to
+per-destination grouping. It is not even clear to what transports
+(and thus jobs) the recipients will be assigned. And with messages
+coming from the "deferred" queue, it is not even clear how many unread
+recipients are still to be delivered. This all means that the
+scheduler must use only estimates of how many recipients entries
+there will be. Fortunately, it is possible to estimate the minimum
+and maximum correctly, so the scheduler can always err on the safe
+side. Obviously, the better the estimates, the better the results, so
+it is best when we are able to read all recipients in-core and turn
+the estimates into exact counts, or at least try to read as many
+as possible to make the estimates as accurate as possible.
+
+</p>
+
+<p>
+
+The third complication is that it is no longer true that the scheduler
+is done with a job once all of its in-core recipients are delivered.
+It is possible that the job will be revived later, when another
+batch of recipients is read in core. It is also possible that some
+jobs will be created for the first time long after the first batch
+of recipients was read in core. The <tt>nqmgr</tt> code must be
+ready to handle all such situations.
+
+</p>
+
+<p>
+
+And finally, the fourth complication is that the <tt>nqmgr</tt>
+code must somehow impose the recipient limit itself. Now how does
+it achieve it?
+
+</p>
+
+<p>
+
+Perhaps the easiest solution would be to say that each message may
+have at maximum X recipients stored in-core, but such a solution would
+be poor for several reasons. With reasonable qmgr_message_active_limit
+values, the X would have to be quite low to maintain a reasonable
+memory footprint. And with low X lots of things would not work well.
+The <tt>nqmgr</tt> would have problems to use the
+<i>transport</i>_destination_recipient_limit efficiently. The
+scheduler's preemption would be suboptimal as the recipient count
+estimates would be inaccurate. The message queue file would have
+to be accessed many times to read in more recipients again and
+again.
+
+</p>
+
+<p>
+
+Therefore it seems reasonable to have a solution which does not use
+a limit imposed on a per-message basis, but which maintains a pool
+of available recipient slots, which can be shared among all messages
+in the most efficient manner. And as we do not want separate
+transports to compete for resources whenever possible, it seems
+appropriate to maintain such a recipient pool for each transport
+separately. This is the general idea, now how does it work in
+practice?
+
+</p>
+
+<p>
+
+First we have to solve a little chicken-and-egg problem. If we want
+to use the per-transport recipient pools, we first need to know to
+what transport(s) the message is assigned. But we will find that
+out only after we first read in the recipients. So it is obvious
+that we first have to read in some recipients, use them to find out
+to what transports the message is to be assigned, and only after
+that can we use the per-transport recipient pools.
+
+</p>
+
+<p>
+
+Now how many recipients shall we read for the first time? This is
+what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
+values control. The qmgr_message_recipient_minimum value specifies
+how many recipients of each message we will read the first time,
+no matter what. It is necessary to read at least one recipient
+before we can assign the message to a transport and create the first
+job. However, reading only qmgr_message_recipient_minimum recipients
+even if there are only few messages with few recipients in-core would
+be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit
+recipients in-core so far, the first batch of recipients may be
+larger than qmgr_message_recipient_minimum - as large as is required
+to reach the qmgr_message_recipient_limit limit.
+
+</p>
+
+<p>
+
+Once the first batch of recipients was read in core and the message
+jobs were created, the size of the subsequent recipient batches (if
+any - of course it's best when all recipients are read in one batch)
+is based solely on the position of the message jobs on their
+corresponding transports' job lists. Each transport has a pool of
+<i>transport</i>_recipient_limit recipient slots which it can
+distribute among its jobs (how this is done is described later).
+The subsequent recipient batch may be as large as the sum of all
+recipient slots of all jobs of the message permits (plus the
+qmgr_message_recipient_minimum amount which always applies).
+
+</p>
+
+<p>
+
+For example, if a message has three jobs, the first with 1 recipient
+still in-core and 4 recipient slots, the second with 5 recipients in-core
+and 5 recipient slots, and the third with 2 recipients in-core and 0
+recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs'
+recipients slots in total. This means that we could immediately
+read 2+qmgr_message_recipient_minimum more recipients of that message
+in core.
+
+</p>
+
+<p>
+
+The above example illustrates several things which might be worth
+mentioning explicitly: first, note that although the per-transport
+slots are assigned to particular jobs, we can't guarantee that once
+the next batch of recipients is read in core, that the corresponding
+amounts of recipients will be assigned to those jobs. The jobs lend
+its slots to the message as a whole, so it is possible that some
+jobs end up sponsoring other jobs of their message. For example,
+if in the example above the 2 newly read recipients were assigned
+to the second job, the first job sponsored the second job with 2
+slots. The second notable thing is the third job, which has more
+recipients in-core than it has slots. Apart from sponsoring by other
+job we just saw it can be result of the first recipient batch, which
+is sponsored from global recipient pool of qmgr_message_recipient_limit
+recipients. It can be also sponsored from the message recipient
+pool of qmgr_message_recipient_minimum recipients.
+
+</p>
+
+<p>
+
+Now how does each transport distribute the recipient slots among
+its jobs? The strategy is quite simple. As most scheduler activity
+happens on the head of the job list, it is our intention to make
+sure that the scheduler has the best estimates of the recipient
+counts for those jobs. As we mentioned above, this means that we
+want to try to make sure that the messages of those jobs have all
+recipients read in-core. Therefore the transport distributes the
+slots "along" the job list from start to end. In this case the job
+list sorted by message enqueue time is used, because it doesn't
+change over time as the scheduler's job list does.
+
+</p>
+
+<p>
+
+More specifically, each time a job is created and appended to the
+job list, it gets all unused recipient slots from its transport's
+pool. It keeps them until all recipients of its message are read.
+When this happens, all unused recipient slots are transferred to
+the next job (which is now in fact the first such job) on the job
+list which still has some recipients unread, or eventually back to
+the transport pool if there is no such job. Such a transfer then also
+happens whenever a recipient entry of that job is delivered.
+
+</p>
+
+<p>
+
+There is also a scenario when a job is not appended to the end of
+the job list (for example it was created as a result of a second or
+later recipient batch). Then it works exactly as above, except that
+if it was put in front of the first unread job (that is, the job
+of a message which still has some unread recipients in the queue file),
+that job is first forced to return all of its unused recipient slots
+to the transport pool.
+
+</p>
+
+<p>
+
+The algorithm just described leads to the following state: The first
+unread job on the job list always gets all the remaining recipient
+slots of that transport (if there are any). The jobs queued before
+this job are completely read (that is, all recipients of their
+message were already read in core) and have at maximum as many slots
+as they still have recipients in-core (the maximum is there because
+of the sponsoring mentioned before) and the jobs after this job get
+nothing from the transport recipient pool (unless they got something
+before and then the first unread job was created and enqueued in
+front of them later - in such a case, they also get at maximum as many
+slots as they have recipients in-core).
+
+</p>
+
+<p>
+
+Things work fine in such a state for most of the time, because the
+current job is either completely read in-core or has as many recipient
+slots as there are, but there is one situation which we still have
+to take care of specially. Imagine if the current job is preempted
+by some unread job from the job list and there are no more recipient
+slots available, so this new current job could read only batches
+of qmgr_message_recipient_minimum recipients at a time. This would
+really degrade performance. For this reason, each transport has an
+extra pool of <i>transport</i>_extra_recipient_limit recipient
+slots, dedicated exactly for this situation. Each time an unread
+job preempts the current job, it gets half of the remaining recipient
+slots from the normal pool and this extra pool.
+
+</p>
+
+<p>
+
+And that's it. It sure does sound pretty complicated, but fortunately
+most people don't really have to care exactly how it works as long
+as it works. Perhaps the only important things to know for most
+people are the following upper bound formulas:
+
+</p>
+
+<p>
+
+Each transport has at maximum
+
+</p>
+
+<blockquote>
+<pre>
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ *_recipient_limit + *_extra_recipient_limit,
+qmgr_message_recipient_limit
+)
+</pre>
+</blockquote>
+
+<p>
+
+recipients in core.
+
+</p>
+
+<p>
+
+The total amount of recipients in core is
+
+</p>
+
+<blockquote>
+<pre>
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ sum( *_recipient_limit + *_extra_recipient_limit ),
+qmgr_message_recipient_limit
+)
+</pre>
+</blockquote>
+
+<p>
+
+where the sum is over all used transports.
+
+</p>
+
+<p>
+
+And this terribly complicated chapter concludes the documentation
+of the <tt>nqmgr</tt> scheduler.
+
+</p>
+
+<p>
+
+[By now you should theoretically know the <tt>nqmgr</tt> scheduler
+inside out. In practice, you still hope that you will never have
+to really understand the last or last two chapters completely, and
+fortunately most people really won't. Understanding how the scheduler
+works in ideal conditions is more than good enough for the vast majority
+of users.]
+
+</p>
+
+<h2> <a name="credits"> Credits </a> </h2>
+
+<ul>
+
+<li> Wietse Venema designed and implemented the initial queue manager
+with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
+feedback.
+
+<li> Patrik Rak designed and implemented preemption where mail with
+fewer recipients can slip past mail with more recipients in a
+controlled manner, and wrote up its documentation.
+
+<li> Wietse Venema initiated a discussion with Patrik Rak and Victor
+Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
+behavior. This is when K/N feedback was reviewed (N = concurrency).
+The discussion ended without a good solution for both negative
+feedback and dead site detection.
+
+<li> Victor Duchovni resumed work on concurrency feedback in the
+context of concurrency-limited servers.
+
+<li> Wietse Venema then re-designed the concurrency scheduler in
+terms of the simplest possible concepts: less-than-1 concurrency
+feedback per delivery, forward and reverse concurrency feedback
+hysteresis, and pseudo-cohort failure. At this same time, concurrency
+feedback was separated from dead site detection.
+
+<li> These simplifications, and their modular implementation, helped
+to develop further insights into the different roles that positive
+and negative concurrency feedback play, and helped to identify some
+worst-case scenarios.
+
+</ul>
+
+</body>
+
+</html>