diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 16:18:56 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 16:18:56 +0000 |
commit | b7c15c31519dc44c1f691e0466badd556ffe9423 (patch) | |
tree | f944572f288bab482a615e09af627d9a2b6727d8 /proto/SCHEDULER_README.html | |
parent | Initial commit. (diff) | |
download | postfix-b7c15c31519dc44c1f691e0466badd556ffe9423.tar.xz postfix-b7c15c31519dc44c1f691e0466badd556ffe9423.zip |
Adding upstream version 3.7.10.upstream/3.7.10
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'proto/SCHEDULER_README.html')
-rw-r--r-- | proto/SCHEDULER_README.html | 1839 |
1 files changed, 1839 insertions, 0 deletions
diff --git a/proto/SCHEDULER_README.html b/proto/SCHEDULER_README.html new file mode 100644 index 0000000..d80b8f4 --- /dev/null +++ b/proto/SCHEDULER_README.html @@ -0,0 +1,1839 @@ +<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> + +<html> + +<head> + +<title>Postfix Queue Scheduler</title> + +<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + +</head> + +<body> + +<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix +Queue Scheduler</h1> + +<hr> + +<h2> Disclaimer </h2> + +<p> Many of the <i>transport</i>-specific configuration parameters +discussed in this document will not show up in "postconf" command +output before Postfix version 2.9. This limitation applies to many +parameters whose name is a combination of a master.cf service name +such as "relay" and a built-in suffix such as +"_destination_concurrency_limit". </p> + +<h2> Overview </h2> + +<p> The queue manager is by far the most complex part of the Postfix +mail system. It schedules delivery of new mail, retries failed +deliveries at specific times, and removes mail from the queue after +the last delivery attempt. There are two major classes of mechanisms +that control the operation of the queue manager. </p> + +<p> Topics covered by this document: </p> + +<ul> + +<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned +with the number of concurrent deliveries to a specific destination, +including decisions on when to suspend deliveries after persistent +failures. + +<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with +the selection of email messages and recipients for a given destination. + +<li> <a href="#credits"> Credits</a>, something this document would not be +complete without. + +</ul> + +<!-- + +<p> Once started, the qmgr(8) process runs until "postfix reload" +or "postfix stop". As a persistent process, the queue manager has +to meet strict requirements with respect to code correctness and +robustness. Unlike non-persistent daemon processes, the queue manager +cannot benefit from Postfix's process rejuvenation mechanism that +limit the impact from resource leaks and other coding errors +(translation: replacing a process after a short time covers up bugs +before they can become a problem). </p> + +--> + +<h2> <a name="concurrency"> Concurrency scheduling </a> </h2> + +<p> The following sections document the Postfix 2.5 concurrency +scheduler, after a discussion of the limitations of the earlier +concurrency scheduler. This is followed by results of medium-concurrency +experiments, and a discussion of trade-offs between performance and +robustness. </p> + +<p> The material is organized as follows: </p> + +<ul> + +<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing +concurrency scheduler </a> + +<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 +concurrency feedback algorithm </a> + +<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead +destination" detection algorithm </a> + +<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 +concurrency scheduler </a> + +<li> <a href="#concurrency_results"> Results for delivery to +concurrency limited servers </a> + +<li> <a href="#concurrency_discussion"> Discussion of concurrency +limited server results </a> + +<li> <a href="#concurrency_limitations"> Limitations of less-than-1 +per delivery feedback </a> + +<li> <a href="#concurrency_config"> Concurrency configuration +parameters </a> + +</ul> + +<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing +concurrency scheduler </a> </h3> + +<p> From the start, Postfix has used a simple but robust algorithm +where the per-destination delivery concurrency is decremented by 1 +after delivery failed due to connection or handshake failure, and +incremented by 1 otherwise. Of course the concurrency is never +allowed to exceed the maximum per-destination concurrency limit. +And when a destination's concurrency level drops to zero, the +destination is declared "dead" and delivery is suspended. </p> + +<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> + +<ul> + +<li> <p> Overshoot due to exponential delivery concurrency growth +with each pseudo-cohort(*). This can be an issue with high-concurrency +channels. For example, with the default initial concurrency of 5, +concurrency would proceed over time as (5-10-20). </p> + +<li> <p> Throttling down to zero concurrency after a single +pseudo-cohort(*) failure. This was especially an issue with +low-concurrency channels where a single failure could be sufficient +to mark a destination as "dead", causing the suspension of further +deliveries to the affected destination. </p> + +</ul> + +<p> (*) A pseudo-cohort is a number of delivery requests equal to +a destination's delivery concurrency. </p> + +<p> The revised concurrency scheduler has a highly modular structure. +It uses separate mechanisms for per-destination concurrency control +and for "dead destination" detection. The concurrency control in +turn is built from two separate mechanisms: it supports less-than-1 +feedback per delivery to allow for more gradual concurrency +adjustments, and it uses feedback hysteresis to suppress concurrency +oscillations. And instead of waiting for delivery concurrency to +throttle down to zero, a destination is declared "dead" after a +configurable number of pseudo-cohorts reports connection or handshake +failure. </p> + +<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 +concurrency feedback algorithm </a> </h3> + +<p> We want to increment a destination's delivery concurrency when +some (not necessarily consecutive) number of deliveries complete +without connection or handshake failure. This is implemented with +positive feedback g(N) where N is the destination's delivery +concurrency. With g(N)=1 feedback per delivery, concurrency increases +by 1 after each positive feedback event; this gives us the old +scheduler's exponential growth in time. With g(N)=1/N feedback per +delivery, concurrency increases by 1 after an entire pseudo-cohort +N of positive feedback reports; this gives us linear growth in time. +Less-than-1 feedback per delivery and integer truncation naturally +give us hysteresis, so that transitions to larger concurrency happen +every 1/g(N) positive feedback events. </p> + +<p> We want to decrement a destination's delivery concurrency when +some (not necessarily consecutive) number of deliveries complete +after connection or handshake failure. This is implemented with +negative feedback f(N) where N is the destination's delivery +concurrency. With f(N)=1 feedback per delivery, concurrency decreases +by 1 after each negative feedback event; this gives us the old +scheduler's behavior where concurrency is throttled down dramatically +after a single pseudo-cohort failure. With f(N)=1/N feedback per +delivery, concurrency backs off more gently. Again, less-than-1 +feedback per delivery and integer truncation naturally give us +hysteresis, so that transitions to lower concurrency happen every +1/f(N) negative feedback events. </p> + +<p> However, with negative feedback we introduce a subtle twist. +We "reverse" the negative hysteresis cycle so that the transition +to lower concurrency happens at the <b>beginning</b> of a sequence +of 1/f(N) negative feedback events. Otherwise, a correction for +overload would be made too late. This makes the choice of f(N) +relatively unimportant, as borne out by measurements later in this +document. </p> + +<p> In summary, the main ingredients for the Postfix 2.5 concurrency +feedback algorithm are a) the option of less-than-1 positive feedback +per delivery to avoid overwhelming servers, b) the option of +less-than-1 negative feedback per delivery to avoid giving up too +fast, c) feedback hysteresis to avoid rapid oscillation, and d) a +"reverse" hysteresis cycle for negative feedback, so that it can +correct for overload quickly. </p> + +<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> + +<p> We want to suspend deliveries to a specific destination after +some number of deliveries suffers connection or handshake failure. +The old scheduler declares a destination "dead" when negative (-1) +feedback throttles the delivery concurrency down to zero. With +less-than-1 feedback per delivery, this throttling down would +obviously take too long. We therefore have to separate "dead +destination" detection from concurrency feedback. This is implemented +by introducing the concept of pseudo-cohort failure. The Postfix +2.5 concurrency scheduler declares a destination "dead" after a +configurable number of pseudo-cohorts suffers from connection or +handshake failures. The old scheduler corresponds to the special +case where the pseudo-cohort failure limit is equal to 1. </p> + +<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> + +<p> The pseudo code shows how the ideas behind new concurrency +scheduler are implemented as of November 2007. The actual code can +be found in the module qmgr/qmgr_queue.c. </p> + +<pre> +Types: + Each destination has one set of the following variables + int concurrency + double success + double failure + double fail_cohorts + +Feedback functions: + N is concurrency; x, y are arbitrary numbers in [0..1] inclusive + positive feedback: g(N) = x/N | x/sqrt(N) | x + negative feedback: f(N) = y/N | y/sqrt(N) | y + +Initialization: + concurrency = initial_concurrency + success = 0 + failure = 0 + fail_cohorts = 0 + +After success: + fail_cohorts = 0 + Be prepared for feedback > hysteresis, or rounding error + success += g(concurrency) + while (success >= 1) Hysteresis 1 + concurrency += 1 Hysteresis 1 + failure = 0 + success -= 1 Hysteresis 1 + Be prepared for overshoot + if (concurrency > concurrency limit) + concurrency = concurrency limit + +Safety: + Don't apply positive feedback unless + concurrency < busy_refcount + init_dest_concurrency + otherwise negative feedback effect could be delayed + +After failure: + if (concurrency > 0) + fail_cohorts += 1.0 / concurrency + if (fail_cohorts > cohort_failure_limit) + concurrency = 0 + if (concurrency > 0) + Be prepared for feedback > hysteresis, rounding errors + failure -= f(concurrency) + while (failure < 0) + concurrency -= 1 Hysteresis 1 + failure += 1 Hysteresis 1 + success = 0 + Be prepared for overshoot + if (concurrency < 1) + concurrency = 1 +</pre> + +<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> + +<p> Discussions about the concurrency scheduler redesign started +early 2004, when the primary goal was to find alternatives that did +not exhibit exponential growth or rapid concurrency throttling. No +code was implemented until late 2007, when the primary concern had +shifted towards better handling of server concurrency limits. For +this reason we measure how well the new scheduler does this +job. The table below compares mail delivery performance of the old ++/-1 feedback per delivery with several less-than-1 feedback +functions, for different limited-concurrency server scenarios. +Measurements were done with a FreeBSD 6.2 client and with FreeBSD +6.2 and various Linux servers. </p> + +<p> Server configuration: </p> + +<ul> <li> The mail flow was slowed down with 1 second latency per +recipient ("smtpd_client_restrictions = sleep 1"). The purpose was +to make results less dependent on hardware details, by avoiding +slow-downs by queue file I/O, logging I/O, and network I/O. + +<li> Concurrency was limited by the server process limit +("default_process_limit = 5" and "smtpd_client_event_limit_exceptions += static:all"). Postfix was stopped and started after changing the +process limit, because the same number is also used as the backlog +argument to the listen(2) system call, and "postfix reload" does +not re-issue this call. + +<li> Mail was discarded with "local_recipient_maps = static:all" and +"local_transport = discard". The discard action in access maps or +header/body checks +could not be used as it fails to update the in_flow_delay counters. + +</ul> + +<p> Client configuration: </p> + +<ul> + +<li> Queue file overhead was minimized by sending one message to a +virtual alias that expanded into 2000 different remote recipients. +All recipients were accounted for according to the maillog file. +The virtual_alias_expansion_limit setting was increased to avoid +complaints from the cleanup(8) server. + +<li> The number of deliveries was maximized with +"smtp_destination_recipient_limit = 2". A smaller limit would cause +Postfix to schedule the concurrency per recipient instead of domain, +which is not what we want. + +<li> Maximum concurrency was limited with +"smtp_destination_concurrency_limit = 20", and +initial_destination_concurrency was set to the same value. + +<li> The positive and negative concurrency feedback hysteresis was +1. Concurrency was incremented by 1 at the END of 1/feedback steps +of positive feedback, and was decremented by 1 at the START of +1/feedback steps of negative feedback. + +<li> The SMTP client used the default 30s SMTP connect timeout and +300s SMTP greeting timeout. + +</ul> + +<h4> Impact of the 30s SMTP connect timeout </h4> + +<p> The first results are for a FreeBSD 6.2 server, where our +artificially low listen(2) backlog results in a very short kernel +queue for established connections. The table shows that all deferred +deliveries failed due to a 30s connection timeout, and none failed +due to a server greeting timeout. This measurement simulates what +happens when the server's connection queue is completely full under +load, and the TCP engine drops new connections. </p> + +<blockquote> + +<table> + +<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> +style</th> <th>connection<br> caching</th> <th>percentage<br> +deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> +<th colspan=2>timed-out in<br> connect/greeting </th> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">no</td> <td +align="center">9.9</td> <td align="center">19.4</td> <td +align="center">0.49</td> <td align="center">198</td> <td +align="center">-</td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">yes</td> <td +align="center">10.3</td> <td align="center">19.4</td> <td +align="center">0.49</td> <td align="center">206</td> <td +align="center">-</td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">no</td> +<td align="center">10.4</td> <td align="center">19.6</td> <td +align="center">0.59</td> <td align="center">208</td> <td +align="center">-</td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">yes</td> +<td align="center">10.6</td> <td align="center">19.6</td> <td +align="center">0.61</td> <td align="center">212</td> <td +align="center">-</td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">no</td> <td +align="center">10.1</td> <td align="center">19.5</td> <td +align="center">1.29</td> <td align="center">202</td> <td +align="center">-</td> </tr> + +<tr><td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">yes</td> <td +align="center">10.8</td> <td align="center">19.3</td> <td +align="center">1.57</td> <td align="center">216</td> <td +align="center">-</td> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +</table> + +<p> A busy server with a completely full connection queue. N is +the client delivery concurrency. Failed deliveries time out after +30s without completing the TCP handshake. See text for a discussion +of results. </p> + +</blockquote> + +<h4> Impact of the 300s SMTP greeting timeout </h4> + +<p> The next table shows results for a Fedora Core 8 server (results +for RedHat 7.3 are identical). In this case, the artificially small +listen(2) backlog argument does not impact our measurement. The +table shows that practically all deferred deliveries fail after the +300s SMTP greeting timeout. As these timeouts were 10x longer than +with the first measurement, we increased the recipient count (and +thus the running time) by a factor of 10 to keep the results +comparable. The deferred mail percentages are a factor 10 lower +than with the first measurement, because the 1s per-recipient delay +was 1/300th of the greeting timeout instead of 1/30th of the +connection timeout. </p> + +<blockquote> + +<table> + +<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> +style</th> <th>connection<br> caching</th> <th>percentage<br> +deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> +<th colspan=2>timed-out in<br> connect/greeting </th> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">no</td> <td +align="center">1.16</td> <td align="center">19.8</td> <td +align="center">0.37</td> <td align="center">-</td> <td +align="center">230</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">yes</td> <td +align="center">1.36</td> <td align="center">19.8</td> <td +align="center">0.36</td> <td align="center">-</td> <td +align="center">272</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">no</td> +<td align="center">1.21</td> <td align="center">19.9</td> <td +align="center">0.23</td> <td align="center">4</td> <td +align="center">238</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">yes</td> +<td align="center">1.36</td> <td align="center">20.0</td> <td +align="center">0.23</td> <td align="center">-</td> <td +align="center">272</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">no</td> <td +align="center">1.18</td> <td align="center">20.0</td> <td +align="center">0.16</td> <td align="center">-</td> <td +align="center">236</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">yes</td> <td +align="center">1.39</td> <td align="center">20.0</td> <td +align="center">0.16</td> <td align="center">-</td> <td +align="center">278</td> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +</table> + +<p> A busy server with a non-full connection queue. N is the client +delivery concurrency. Failed deliveries complete at the TCP level, +but time out after 300s while waiting for the SMTP greeting. See +text for a discussion of results. </p> + +</blockquote> + +<h4> Impact of active server concurrency limiter </h4> + +<p> The final concurrency-limited result shows what happens when +SMTP connections don't time out, but are rejected immediately with +the Postfix server's smtpd_client_connection_count_limit feature +(the server replies with a 421 status and disconnects immediately). +Similar results can be expected with concurrency limiting features +built into other MTAs or firewalls. For this measurement we specified +a server concurrency limit and a client initial destination concurrency +of 5, and a server process limit of 10; all other conditions were +the same as with the first measurement. The same result would be +obtained with a FreeBSD or Linux server, because the "pushing back" +is done entirely by the receiving side. </p> + +<blockquote> + +<table> + +<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> +style</th> <th>connection<br> caching</th> <th>percentage<br> +deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> +<th>theoretical<br>defer rate</th> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">no</td> <td +align="center">16.5</td> <td align="center">5.17</td> <td +align="center">0.38</td> <td align="center">1/6</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/N</td> <td align="center">yes</td> <td +align="center">16.5</td> <td align="center">5.17</td> <td +align="center">0.38</td> <td align="center">1/6</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">no</td> +<td align="center">24.5</td> <td align="center">5.28</td> <td +align="center">0.45</td> <td align="center">1/4</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1/sqrt(N)</td> <td align="center">yes</td> +<td align="center">24.3</td> <td align="center">5.28</td> <td +align="center">0.46</td> <td align="center">1/4</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">no</td> <td +align="center">49.7</td> <td align="center">5.63</td> <td +align="center">0.67</td> <td align="center">1/2</td> </tr> + +<tr> <td align="center">20</td> <td align="center">5</td> <td +align="center">1</td> <td align="center">yes</td> <td +align="center">49.7</td> <td align="center">5.68</td> <td +align="center">0.70</td> <td align="center">1/2</td> </tr> + +<tr> <td align="center" colspan="9"> <hr> </td> </tr> + +</table> + +<p> A server with active per-client concurrency limiter that replies +with 421 and disconnects. N is the client delivery concurrency. +The theoretical defer rate is 1/(1+roundup(1/feedback)). This is +always 1/2 with the fixed +/-1 feedback per delivery; with the +concurrency-dependent feedback variants, the defer rate decreases +with increasing concurrency. See text for a discussion of results. +</p> + +</blockquote> + +<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> + +<p> All results in the previous sections are based on the first +delivery runs only; they do not include any second etc. delivery +attempts. It's also worth noting that the measurements look at +steady-state behavior only. They don't show what happens when the +client starts sending at a much higher or lower concurrency. +</p> + +<p> The first two examples show that the effect of feedback +is negligible when concurrency is limited due to congestion. This +is because the initial concurrency is already at the client's +concurrency maximum, and because there is 10-100 times more positive +than negative feedback. Under these conditions, it is no surprise +that the contribution from SMTP connection caching is also negligible. +</p> + +<p> In the last example, the old +/-1 feedback per delivery will +defer 50% of the mail when confronted with an active (anvil-style) +server concurrency limit, where the server hangs up immediately +with a 421 status (a TCP-level RST would have the same result). +Less aggressive feedback mechanisms fare better than more aggressive +ones. Concurrency-dependent feedback fares even better at higher +concurrencies than shown here, but has limitations as discussed in +the next section. </p> + +<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> + +<p> Less-than-1 feedback is of interest primarily when sending large +amounts of mail to destinations with active concurrency limiters +(servers that reply with 421, or firewalls that send RST). When +sending small amounts of mail per destination, less-than-1 per-delivery +feedback won't have a noticeable effect on the per-destination +concurrency, because the number of deliveries to the same destination +is too small. You might just as well use zero per-delivery feedback +and stay with the initial per-destination concurrency. And when +mail deliveries fail due to congestion instead of active concurrency +limiters, the measurements above show that per-delivery feedback +has no effect. With large amounts of mail you might just as well +use zero per-delivery feedback and start with the maximal per-destination +concurrency. </p> + +<p> The scheduler with less-than-1 concurrency +feedback per delivery solves a problem with servers that have active +concurrency limiters. This works only because feedback is handled +in a peculiar manner: positive feedback will increment the concurrency +by 1 at the <b>end</b> of a sequence of events of length 1/feedback, +while negative feedback will decrement concurrency by 1 at the +<b>beginning</b> of such a sequence. This is how Postfix adjusts +quickly for overshoot without causing lots of mail to be deferred. +Without this difference in feedback treatment, less-than-1 feedback +per delivery would defer 50% of the mail, and would be no better +in this respect than the old +/-1 feedback per delivery. </p> + +<p> Unfortunately, the same feature that corrects quickly for +concurrency overshoot also makes the scheduler more sensitive for +noisy negative feedback. The reason is that one lonely negative +feedback event has the same effect as a complete sequence of length +1/feedback: in both cases delivery concurrency is dropped by 1 +immediately. As a worst-case scenario, consider multiple servers +behind a load balancer on a single IP address, and no backup MX +address. When 1 out of K servers fails to complete the SMTP handshake +or drops the connection, a scheduler with 1/N (N = concurrency) +feedback stops increasing its concurrency once it reaches a concurrency +level of about K, even though the good servers behind the load +balancer are perfectly capable of handling more traffic. </p> + +<p> This noise problem gets worse as the amount of positive feedback +per delivery gets smaller. A compromise is to use fixed less-than-1 +positive feedback values instead of concurrency-dependent positive +feedback. For example, to tolerate 1 of 4 bad servers in the above +load balancer scenario, use positive feedback of 1/4 per "good" +delivery (no connect or handshake error), and use an equal or smaller +amount of negative feedback per "bad" delivery. The downside of +using concurrency-independent feedback is that some of the old +/-1 +feedback problems will return at large concurrencies. Sites that +must deliver mail at non-trivial per-destination concurrencies will +require special configuration. </p> + +<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> + +<p> The Postfix 2.5 concurrency scheduler is controlled with the +following configuration parameters, where "<i>transport</i>_foo" +provides a transport-specific parameter override. All parameter +default settings are compatible with earlier Postfix versions. </p> + +<blockquote> + +<table border="0"> + +<tr> <th> Parameter name </th> <th> Postfix version </th> <th> +Description </th> </tr> + +<tr> <td colspan="3"> <hr> </td> </tr> + +<tr> <td> initial_destination_concurrency<br> +<i>transport</i>_initial_destination_concurrency </td> <td +align="center"> all<br> 2.5 </td> <td> Initial per-destination +delivery concurrency </td> </tr> + +<tr> <td> default_destination_concurrency_limit<br> +<i>transport</i>_destination_concurrency_limit </td> <td align="center"> +all<br> all </td> <td> Maximum per-destination delivery concurrency +</td> </tr> + +<tr> <td> default_destination_concurrency_positive_feedback<br> +<i>transport</i>_destination_concurrency_positive_feedback </td> +<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive +feedback amount, per delivery that does not fail with connection +or handshake failure </td> </tr> + +<tr> <td> default_destination_concurrency_negative_feedback<br> +<i>transport</i>_destination_concurrency_negative_feedback </td> +<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative +feedback amount, per delivery that fails with connection or handshake +failure </td> </tr> + +<tr> <td> default_destination_concurrency_failed_cohort_limit<br> +<i>transport</i>_destination_concurrency_failed_cohort_limit </td> +<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed +pseudo-cohorts after which a destination is declared "dead" and +delivery is suspended </td> </tr> + +<tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> +2.5 </td> <td> Enable verbose logging of concurrency scheduler +activity </td> </tr> + +<tr> <td colspan="3"> <hr> </td> </tr> + +</table> + +</blockquote> + +<h2> <a name="jobs"> Preemptive scheduling </a> </h2> + +<p> + +The following sections describe the new queue manager and its +preemptive scheduler algorithm. Note that the document was originally +written to describe the changes between the new queue manager (in +this text referred to as <tt>nqmgr</tt>, the name it was known by +before it became the default queue manager) and the old queue manager +(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> +every so often. + +</p> + +<p> + +This document is divided into sections as follows: + +</p> + +<ul> + +<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by +nqmgr </a> + +<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks +up the message </a> - how it is assigned to transports, jobs, peers, +entries + +<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection +works </a> + +<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption +works </a> - what messages may be preempted and how and what messages +are chosen to preempt them + +<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency +limits affect the scheduling algorithm </a> + +<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource +limits </a> + +</ul> + +<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by +nqmgr </a> </h3> + +<p> + +Let's start by recapitulating the structures and terms used when +referring to the queue manager and how it operates. Many of these are +partially described elsewhere, but it is nice to have a coherent +overview in one place: + +</p> + +<ul> + +<li> <p> Each message structure represents one mail message which +Postfix is to deliver. The message recipients specify to what +destinations is the message to be delivered and what transports are +going to be used for the delivery. </p> + +<li> <p> Each recipient entry groups a batch of recipients of one +message which are all going to be delivered to the same destination +(and over the same transport). +</p> + +<li> <p> Each transport structure groups everything what is going +to be delivered by delivery agents dedicated for that transport. +Each transport maintains a set of queues (describing the destinations +it shall talk to) and jobs (referencing the messages it shall +deliver). </p> + +<li> <p> Each transport queue (not to be confused with the on-disk +"active" queue or "incoming" queue) groups everything what is going be +delivered to given destination (aka nexthop) by its transport. Each +queue belongs to one transport, so each destination may be referred +to by several queues, one for each transport. Each queue maintains +a list of all recipient entries (batches of message recipients) +which shall be delivered to given destination (the todo list), and +a list of recipient entries already being delivered by the delivery +agents (the busy list). </p> + +<li> <p> Each queue corresponds to multiple peer structures. Each +peer structure is like the queue structure, belonging to one transport +and referencing one destination. The difference is that it lists +only the recipient entries which all originate from the same message, +unlike the queue structure, whose entries may originate from various +messages. For messages with few recipients, there is usually just +one recipient entry for each destination, resulting in one recipient +entry per peer. But for large mailing list messages the recipients +may need to be split to multiple recipient entries, in which case +the peer structure may list many entries for single destination. +</p> + +<li> <p> Each transport job groups everything it takes to deliver +one message via its transport. Each job represents one message +within the context of the transport. The job belongs to one transport +and message, so each message may have multiple jobs, one for each +transport. The job groups all the peer structures, which describe +the destinations the job's message has to be delivered to. </p> + +</ul> + +<p> + +The first four structures are common to both <tt>nqmgr</tt> and +<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. + +</p> + +<p> + +These terms are used extensively in the text below, feel free to +look up the description above anytime you'll feel you have lost a +sense what is what. + +</p> + +<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks +up the message </a> </h3> + +<p> + +Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue, +the following happens: It reads all necessary information from the +queue file as <tt>oqmgr</tt> does, and also reads as many recipients +as possible - more on that later, for now let's just pretend it +always reads all recipients. + +</p> + +<p> + +Then it resolves the recipients as <tt>oqmgr</tt> does, which +means obtaining (address, nexthop, transport) triple for each +recipient. For each triple, it finds the transport; if it does not +exist yet, it instantiates it (unless it's dead). Within the +transport, it finds the destination queue for the given nexthop; if it +does not exist yet, it instantiates it (unless it's dead). The +triple is then bound to given destination queue. This happens in +qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. + +</p> + +<p> + +Then for each triple which was bound to some queue (and thus +transport), the program finds the job which represents the message +within that transport's context; if it does not exist yet, it +instantiates it. Within the job, it finds the peer which represents +the bound destination queue within this jobs context; if it does +not exist yet, it instantiates it. Finally, it stores the address +from the resolved triple to the recipient entry which is appended +to both the queue entry list and the peer entry list. The addresses +for the same nexthop are batched in the entries up to the +<i>transport</i>_destination_recipient_limit for that transport. +This happens in qmgr_message_assign(), and apart +from that it operates with job and peer structures, it is basically the +same as in <tt>oqmgr</tt>. + +</p> + +<p> + +When the job is instantiated, it is enqueued on the transport's job +list based on the time its message was picked up by <tt>nqmgr</tt>. +For first batch of recipients this means it is appended to the end +of the job list, but the ordering of the job list by the enqueue +time is important as we will see shortly. + +</p> + +<p> + +[Now you should have a pretty good idea what the state of the +<tt>nqmgr</tt> is after a couple of messages were picked up, and what the +relation is between all those job, peer, queue and entry structures.] + +</p> + +<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection +works </a> </h3> + +<p> + +Having prepared all those above mentioned structures, the task of +the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries +one at a time and pass them to the delivery agent for corresponding +transport. Now how does this work? + +</p> + +<p> + +The first approximation of the new scheduling algorithm is like this: + +</p> + +<blockquote> +<pre> +foreach transport (round-robin-by-transport) +do + if transport busy continue + if transport process limit reached continue + foreach transport's job (in the order of the transport's job list) + do + foreach job's peer (round-robin-by-destination) + if peer->queue->concurrency < peer->queue->window + return next peer entry. + done + done +done +</pre> +</blockquote> + +<p> + +Now what is the "order of the transport's job list"? As we know +already, the job list is by default kept in the order the message +was picked up by the <tt>nqmgr</tt>. So by default we get the +top-level round-robin transport, and within each transport we get +the FIFO message delivery. The round-robin of the peers by the +destination is perhaps of little importance in most real-life cases +(unless the <i>transport</i>_destination_recipient_limit is reached, +in one job there +is only one peer structure for each destination), but theoretically +it makes sure that even within single jobs, destinations are treated +fairly. + +</p> + +<p> + +[By now you should have a feeling you really know how the scheduler +works, except for the preemption, under ideal conditions - that is, +no recipient resource limits and no destination concurrency problems.] + +</p> + +<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption +works </a> </h3> + +<p> + +As you might perhaps expect by now, the transport's job list does +not remain sorted by the job's message enqueue time all the time. +The most cool thing about <tt>nqmgr</tt> is not the simple FIFO +delivery, but that it is able to slip mail with little recipients +past the mailing-list bulk mail. This is what the job preemption +is about - shuffling the jobs on the transport's job list to get +the best message delivery rates. Now how is it achieved? + +</p> + +<p> + +First I have to tell you that there are in fact two job lists in +each transport. One is the scheduler's job list, which the scheduler +is free to play with, while the other one keeps the jobs always +listed in the order of the enqueue time and is used for recipient +pool management we will discuss later. For now, we will deal with +the scheduler's job list only. + +</p> + +<p> + +So, we have the job list, which is first ordered by the time the +jobs' messages were enqueued, oldest messages first, the most recently +picked one at the end. For now, let's assume that there are no +destination concurrency problems. Without preemption, we pick some +entry of the first (oldest) job on the queue, assign it to delivery +agent, pick another one from the same job, assign it again, and so +on, until all the entries are used and the job is delivered. We +would then move onto the next job and so on and on. Now how do we +manage to sneak in some entries from the recently added jobs when +the first job on the job list belongs to a message going to the +mailing-list and has thousands of recipient entries? + +</p> + +<p> + +The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" +the delivery time of that first job by some constant for free - it +is basically the same trick you might remember as "accumulation of +potential" from the amortized complexity lessons. For example, +instead of delivering the entries of the first job on the job list +every time a delivery agent becomes available, we can do it only +every second time. If you view the moments the delivery agent becomes +available on a timeline as "delivery slots", then instead of using +every delivery slot for the first job, we can use only every other +slot, and still the overall delivery efficiency of the first job +remains the same. So the delivery <tt>11112222</tt> becomes +<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . +denotes the free slot). Now what do we do with free slots? + +</p> + +<p> + +As you might have guessed, we will use them for sneaking the mail +with little recipients in. For example, if we have one four-recipient +mail followed by four one recipients mail, the delivery sequence +(that is, the sequence in which the jobs are assigned to the +delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine +for sneaking in the single recipient mail, but how do we sneak in +the mail with more than one recipient? Say if we have one four-recipient +mail followed by two two-recipient mails? + +</p> + +<p> + +The simple answer would be to use delivery sequence <tt>12121313</tt>. +But the problem is that this does not scale well. Imagine you have +mail with a thousand recipients followed by mail with a hundred recipients. +It is tempting to suggest the delivery sequence like <tt>121212....</tt>, +but alas! Imagine there arrives another mail with say ten recipients. +But there are no free slots anymore, so it can't slip by, not even +if it had only one recipient. It will be stuck until the +hundred-recipient mail is delivered, which really sucks. + +</p> + +<p> + +So, it becomes obvious that while inflating the message to get +free slots is a great idea, one has to be really careful of how the +free slots are assigned, otherwise one might corner himself. So, +how does <tt>nqmgr</tt> really use the free slots? + +</p> + +<p> + +The key idea is that one does not have to generate the free slots +in a uniform way. The delivery sequence <tt>111...1</tt> is no +worse than <tt>1.1.1.1</tt>, in fact, it is even better as some +entries are in the first case selected earlier than in the second +case, and none is selected later! So it is possible to first +"accumulate" the free delivery slots and then use them all at once. +It is even possible to accumulate some, then use them, then accumulate +some more and use them again, as in <tt>11..1.1</tt> . + +</p> + +<p> + +Let's get back to the one hundred recipient example. We now know +that we could first accumulate one hundred free slots, and only +after then to preempt the first job and sneak the one hundred +recipient mail in. Applying the algorithm recursively, we see the +hundred recipient job can accumulate ten free delivery slots, and +then we could preempt it and sneak in the ten-recipient mail... +Wait wait wait! Could we? Aren't we overinflating the original one +thousand recipient mail? + +</p> + +<p> + +Well, despite the fact that it looks so at the first glance, another trick will +allow us to answer "no, we are not!". If we had said that we will +inflate the delivery time twice at maximum, and then we consider +every other slot as a free slot, then we would overinflate in case +of the recursive preemption. BUT! The trick is that if we use only +every n-th slot as a free slot for n>2, there is always some worst +inflation factor which we can guarantee not to be breached, even +if we apply the algorithm recursively. To be precise, if for every +k>1 normally used slots we accumulate one free delivery slot, than +the inflation factor is not worse than k/(k-1) no matter how many +recursive preemptions happen. And it's not worse than (k+1)/k if +only non-recursive preemption happens. Now, having got through the +theory and the related math, let's see how <tt>nqmgr</tt> implements +this. + +</p> + +<p> + +Each job has so called "available delivery slot" counter. Each +transport has a <i>transport</i>_delivery_slot_cost parameter, which +defaults to default_delivery_slot_cost parameter which is set to 5 +by default. This is the k from the paragraph above. Each time k +entries of the job are selected for delivery, this counter is +incremented by one. Once there are some slots accumulated, a job which +requires no more than that number of slots to be fully delivered +can preempt this job. + +</p> + +<p> + +[Well, the truth is, the counter is incremented every time an entry +is selected and it is divided by k when it is used. +But to understand, it's good enough to use +the above approximation of the truth.] + +</p> + +<p> + +OK, so now we know the conditions which must be satisfied so one +job can preempt another one. But what job gets preempted, how do +we choose what job preempts it if there are several valid candidates, +and when does all this exactly happen? + +</p> + +<p> + +The answer for the first part is simple. The job whose entry was +selected the last time is the so called current job. Normally, it is +the first job on the scheduler's job list, but destination concurrency +limits may change this as we will see later. It is always only the +current job which may get preempted. + +</p> + +<p> + +Now for the second part. The current job has a certain amount of +recipient entries, and as such may accumulate at maximum some amount +of available delivery slots. It might have already accumulated some, +and perhaps even already used some when it was preempted before +(remember a job can be preempted several times). In either case, +we know how many are accumulated and how many are left to deliver, +so we know how many it may yet accumulate at maximum. Every other +job which may be delivered by less than that number of slots is a +valid candidate for preemption. How do we choose among them? + +</p> + +<p> + +The answer is - the one with maximum enqueue_time/recipient_entry_count. +That is, the older the job is, the more we should try to deliver +it in order to get best message delivery rates. These rates are of +course subject to how many recipients the message has, therefore +the division by the recipient (entry) count. No one shall be surprised +that a message with n recipients takes n times longer to deliver than +a message with one recipient. + +</p> + +<p> + +Now let's recap the previous two paragraphs. Isn't it too complicated? +Why don't the candidates come only among the jobs which can be +delivered within the number of slots the current job already +accumulated? Why do we need to estimate how much it has yet to +accumulate? If you found out the answer, congratulate yourself. If +we did it this simple way, we would always choose the candidate +with the fewest recipient entries. If there were enough single recipient +mails coming in, they would always slip by the bulk mail as soon +as possible, and the two or more recipients mail would never get +a chance, no matter how long they have been sitting around in the +job list. + +</p> + +<p> + +This candidate selection has an interesting implication - that when +we choose the best candidate for preemption (this is done in +qmgr_choose_candidate()), it may happen that we may not use it for +preemption immediately. This leads to an answer to the last part +of the original question - when does the preemption happen? + +</p> + +<p> + +The preemption attempt happens every time next transport's recipient +entry is to be chosen for delivery. To avoid needless overhead, the +preemption is not attempted if the current job could never accumulate +more than <i>transport</i>_minimum_delivery_slots (defaults to +default_minimum_delivery_slots which defaults to 3). If there are +already enough accumulated slots to preempt the current job by the +chosen best candidate, it is done immediately. This basically means +that the candidate is moved in front of the current job on the +scheduler's job list and decreasing the accumulated slot counter +by the amount used by the candidate. If there are not enough slots... +well, I could say that nothing happens and the another preemption +is attempted the next time. But that's not the complete truth. + +</p> + +<p> + +The truth is that it turns out that it is not really necessary to +wait until the jobs counter accumulates all the delivery slots in +advance. Say we have ten-recipient mail followed by two two-recipient +mails. If the preemption happened when enough delivery slots accumulate +(assuming slot cost 2), the delivery sequence becomes +<tt>11112211113311</tt>. Now what would we get if we would wait +only for 50% of the necessary slots to accumulate and we promise +we would wait for the remaining 50% later, after we get back +to the preempted job? If we use such a slot loan, the delivery sequence +becomes <tt>11221111331111</tt>. As we can see, it makes it not +considerably worse for the delivery of the ten-recipient mail, but +it allows the small messages to be delivered sooner. + +</p> + +<p> + +The concept of these slot loans is where the +<i>transport</i>_delivery_slot_discount and +<i>transport</i>_delivery_slot_loan come from (they default to +default_delivery_slot_discount and default_delivery_slot_loan, whose +values are by default 50 and 3, respectively). The discount (resp. +loan) specifies how many percent (resp. how many slots) one "gets +in advance", when the number of slots required to deliver the best +candidate is compared with the number of slots the current slot had +accumulated so far. + +</p> + +<p> + +And that pretty much concludes this chapter. + +</p> + +<p> + +[Now you should have a feeling that you pretty much understand the +scheduler and the preemption, or at least that you will have +after you read the last chapter a couple more times. You shall clearly +see the job list and the preemption happening at its head, in ideal +delivery conditions. The feeling of understanding shall last until +you start wondering what happens if some of the jobs are blocked, +which you might eventually figure out correctly from what had been +said already. But I would be surprised if your mental image of the +scheduler's functionality is not completely shattered once you +start wondering how it works when not all recipients may be read +in-core. More on that later.] + +</p> + +<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency +limits affect the scheduling algorithm </a> </h3> + +<p> + +The <tt>nqmgr</tt> uses the same algorithm for destination concurrency +control as <tt>oqmgr</tt>. Now what happens when the destination +limits are reached and no more entries for that destination may be +selected by the scheduler? + +</p> + +<p> + +From the user's point of view it is all simple. If some of the peers +of a job can't be selected, those peers are simply skipped by the +entry selection algorithm (the pseudo-code described before) and +only the selectable ones are used. If none of the peers may be +selected, the job is declared a "blocker job". Blocker jobs are +skipped by the entry selection algorithm and they are also excluded +from the candidates for preemption of the current job. Thus the scheduler +effectively behaves as if the blocker jobs didn't exist on the job +list at all. As soon as at least one of the peers of a blocker job +becomes unblocked (that is, the delivery agent handling the delivery +of the recipient entry for the given destination successfully finishes), +the job's blocker status is removed and the job again participates +in all further scheduler actions normally. + +</p> + +<p> + +So the summary is that the users don't really have to be concerned +about the interaction of the destination limits and scheduling +algorithm. It works well on its own and there are no knobs they +would need to control it. + +</p> + +<p> + +From a programmer's point of view, the blocker jobs complicate the +scheduler quite a lot. Without them, the jobs on the job list would +be normally delivered in strict FIFO order. If the current job is +preempted, the job preempting it is completely delivered unless it +is preempted itself. Without blockers, the current job is thus +always either the first job on the job list, or the top of the stack +of jobs preempting the first job on the job list. + +</p> + +<p> + +The visualization of the job list and the preemption stack without +blockers would be like this: + +</p> + +<blockquote> +<pre> +first job-> 1--2--3--5--6--8--... <- job list +on job list | + 4 <- preemption stack + | +current job-> 7 +</pre> +</blockquote> + +<p> + +In the example above we see that job 1 was preempted by job 4 and +then job 4 was preempted by job 7. After job 7 is completed, remaining +entries of job 4 are selected, and once they are all selected, job +1 continues. + +</p> + +<p> + +As we see, it's all very clean and straightforward. Now how does +this change because of blockers? + +</p> + +<p> + +The answer is: a lot. Any job may become a blocker job at any time, +and also become a normal job again at any time. This has several +important implications: + +</p> + +<ol> + +<li> <p> + +The jobs may be completed in arbitrary order. For example, in the +example above, if the current job 7 becomes blocked, the next job +4 may complete before the job 7 becomes unblocked again. Or if both +7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked +and is completed, then 2 is completed and only after that 4 becomes +unblocked and is completed... You get the idea. + +</p> + +<p> + +[Interesting side note: even when jobs are delivered out of order, +from a single destination's point of view the jobs are still delivered +in the expected order (that is, FIFO unless there was some preemption +involved). This is because whenever a destination queue becomes +unblocked (the destination limit allows selection of more recipient +entries for that destination), all jobs which have peers for that +destination are unblocked at once.] + +</p> + +<li> <p> + +The idea of the preemption stack at the head of the job list is +gone. That is, it must be possible to preempt any job on the job +list. For example, if the jobs 7, 4, 1 and 2 in the example above +become all blocked, job 3 becomes the current job. And of course +we do not want the preemption to be affected by the fact that there +are some blocked jobs or not. Therefore, if it turns out that job +3 might be preempted by job 6, the implementation shall make it +possible. + +</p> + +<li> <p> + +The idea of the linear preemption stack itself is gone. It's no +longer true that one job is always preempted by only one job at one +time (that is directly preempted, not counting the recursively +nested jobs). For example, in the example above, job 1 is directly +preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes +blocked, and job 4 is being delivered. If it accumulates enough +delivery slots, it is natural that it might be preempted for example +by job 8. Now job 4 is preempted by both job 7 AND job 8 at the +same time. + +</p> + +</ol> + +<p> + +Now combine the points 2) and 3) with point 1) again and you realize +that the relations on the once linear job list became pretty +complicated. If we extend the point 3) example: jobs 7 and 8 preempt +job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, +huh? + +</p> + +<p> + +If I illustrate the relations after the above mentioned examples +(but those in point 1), the situation would look like this: + +</p> + +<blockquote> +<pre> + v- parent + +adoptive parent -> 1--2--3--5--... <- "stack" level 0 + | | +parent gone -> ? 6 <- "stack" level 1 + / \ +children -> 7 8 ^- child <- "stack" level 2 + + ^- siblings +</pre> +</blockquote> + +<p> + +Now how does <tt>nqmgr</tt> deal with all these complicated relations? + +</p> + +<p> + +Well, it maintains them all as described, but fortunately, all these +relations are necessary only for the purpose of proper counting of +available delivery slots. For the purpose of ordering the jobs for +entry selection, the original rule still applies: "the job preempting +the current job is moved in front of the current job on the job +list". So for entry selection purposes, the job relations remain +as simple as this: + +</p> + +<blockquote> +<pre> +7--8--1--2--6--3--5--.. <- scheduler's job list order +</pre> +</blockquote> + +<p> + +The job list order and the preemption parent/child/siblings relations +are maintained separately. And because the selection works only +with the job list, you can happily forget about those complicated +relations unless you want to study the <tt>nqmgr</tt> sources. In +that case the text above might provide some helpful introduction +to the problem domain. Otherwise I suggest you just forget about +all this and stick with the user's point of view: the blocker jobs +are simply ignored. + +</p> + +<p> + +[By now, you should have a feeling that there are more things going +on under the hood than you ever wanted to know. You decide that +forgetting about this chapter is the best you can do for the sake +of your mind's health and you basically stick with the idea how the +scheduler works in ideal conditions, when there are no blockers, +which is good enough.] + +</p> + +<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource +limits </a> </h3> + +<p> + +When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed +that all recipients of all messages in the "active" queue are completely +read into memory. This is simply not true. There is an upper +bound on the amount of memory the <tt>nqmgr</tt> may use, and +therefore it must impose some limits on the information it may store +in memory at any given time. + +</p> + +<p> + +First of all, not all messages may be read in-core at once. At any +time, only qmgr_message_active_limit messages may be read in-core +at maximum. When read into memory, the messages are picked from the +"incoming" and "deferred" queues and moved to the "active" queue +(incoming having priority), so if there are more than +qmgr_message_active_limit messages queued in the "active" queue, the +rest will have to wait until (some of) the messages in the "active" queue +are completely delivered (or deferred). + +</p> + +<p> + +Even with the limited amount of in-core messages, there is another +limit which must be imposed in order to avoid memory exhaustion. +Each message may contain a huge number of recipients (tens or hundreds +of thousands are not uncommon), so if <tt>nqmgr</tt> read all +recipients of all messages in the "active" queue, it may easily run +out of memory. Therefore there must be some upper bound on the +amount of message recipients which are read into memory at the +same time. + +</p> + +<p> + +Before discussing how exactly <tt>nqmgr</tt> implements the recipient +limits, let's see how the sole existence of the limits themselves +affects the <tt>nqmgr</tt> and its scheduler. + +</p> + +<p> + +The message limit is straightforward - it just limits the size of +the +lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which +message can preempt the current one. Messages not in the "active" queue +are simply not considered at all. + +</p> + +<p> + +The recipient limit complicates more things. First of all, the +message reading code must support reading the recipients in batches, +which among other things means accessing the queue file several +times and continuing where the last recipient batch ended. This is +invoked by the scheduler whenever the current job has space for more +recipients, subject to transport's refill_limit and refill_delay parameters. +It is also done any time when all +in-core recipients of the message are dealt with (which may also +mean they were deferred) but there are still more in the queue file. + +</p> + +<p> + +The second complication is that with some recipients left unread +in the queue file, the scheduler can't operate with exact counts +of recipient entries. With unread recipients, it is not clear how +many recipient entries there will be, as they are subject to +per-destination grouping. It is not even clear to what transports +(and thus jobs) the recipients will be assigned. And with messages +coming from the "deferred" queue, it is not even clear how many unread +recipients are still to be delivered. This all means that the +scheduler must use only estimates of how many recipients entries +there will be. Fortunately, it is possible to estimate the minimum +and maximum correctly, so the scheduler can always err on the safe +side. Obviously, the better the estimates, the better the results, so +it is best when we are able to read all recipients in-core and turn +the estimates into exact counts, or at least try to read as many +as possible to make the estimates as accurate as possible. + +</p> + +<p> + +The third complication is that it is no longer true that the scheduler +is done with a job once all of its in-core recipients are delivered. +It is possible that the job will be revived later, when another +batch of recipients is read in core. It is also possible that some +jobs will be created for the first time long after the first batch +of recipients was read in core. The <tt>nqmgr</tt> code must be +ready to handle all such situations. + +</p> + +<p> + +And finally, the fourth complication is that the <tt>nqmgr</tt> +code must somehow impose the recipient limit itself. Now how does +it achieve it? + +</p> + +<p> + +Perhaps the easiest solution would be to say that each message may +have at maximum X recipients stored in-core, but such a solution would +be poor for several reasons. With reasonable qmgr_message_active_limit +values, the X would have to be quite low to maintain a reasonable +memory footprint. And with low X lots of things would not work well. +The <tt>nqmgr</tt> would have problems to use the +<i>transport</i>_destination_recipient_limit efficiently. The +scheduler's preemption would be suboptimal as the recipient count +estimates would be inaccurate. The message queue file would have +to be accessed many times to read in more recipients again and +again. + +</p> + +<p> + +Therefore it seems reasonable to have a solution which does not use +a limit imposed on a per-message basis, but which maintains a pool +of available recipient slots, which can be shared among all messages +in the most efficient manner. And as we do not want separate +transports to compete for resources whenever possible, it seems +appropriate to maintain such a recipient pool for each transport +separately. This is the general idea, now how does it work in +practice? + +</p> + +<p> + +First we have to solve a little chicken-and-egg problem. If we want +to use the per-transport recipient pools, we first need to know to +what transport(s) the message is assigned. But we will find that +out only after we first read in the recipients. So it is obvious +that we first have to read in some recipients, use them to find out +to what transports the message is to be assigned, and only after +that can we use the per-transport recipient pools. + +</p> + +<p> + +Now how many recipients shall we read for the first time? This is +what qmgr_message_recipient_minimum and qmgr_message_recipient_limit +values control. The qmgr_message_recipient_minimum value specifies +how many recipients of each message we will read the first time, +no matter what. It is necessary to read at least one recipient +before we can assign the message to a transport and create the first +job. However, reading only qmgr_message_recipient_minimum recipients +even if there are only few messages with few recipients in-core would +be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit +recipients in-core so far, the first batch of recipients may be +larger than qmgr_message_recipient_minimum - as large as is required +to reach the qmgr_message_recipient_limit limit. + +</p> + +<p> + +Once the first batch of recipients was read in core and the message +jobs were created, the size of the subsequent recipient batches (if +any - of course it's best when all recipients are read in one batch) +is based solely on the position of the message jobs on their +corresponding transports' job lists. Each transport has a pool of +<i>transport</i>_recipient_limit recipient slots which it can +distribute among its jobs (how this is done is described later). +The subsequent recipient batch may be as large as the sum of all +recipient slots of all jobs of the message permits (plus the +qmgr_message_recipient_minimum amount which always applies). + +</p> + +<p> + +For example, if a message has three jobs, the first with 1 recipient +still in-core and 4 recipient slots, the second with 5 recipients in-core +and 5 recipient slots, and the third with 2 recipients in-core and 0 +recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' +recipients slots in total. This means that we could immediately +read 2+qmgr_message_recipient_minimum more recipients of that message +in core. + +</p> + +<p> + +The above example illustrates several things which might be worth +mentioning explicitly: first, note that although the per-transport +slots are assigned to particular jobs, we can't guarantee that once +the next batch of recipients is read in core, that the corresponding +amounts of recipients will be assigned to those jobs. The jobs lend +its slots to the message as a whole, so it is possible that some +jobs end up sponsoring other jobs of their message. For example, +if in the example above the 2 newly read recipients were assigned +to the second job, the first job sponsored the second job with 2 +slots. The second notable thing is the third job, which has more +recipients in-core than it has slots. Apart from sponsoring by other +job we just saw it can be result of the first recipient batch, which +is sponsored from global recipient pool of qmgr_message_recipient_limit +recipients. It can be also sponsored from the message recipient +pool of qmgr_message_recipient_minimum recipients. + +</p> + +<p> + +Now how does each transport distribute the recipient slots among +its jobs? The strategy is quite simple. As most scheduler activity +happens on the head of the job list, it is our intention to make +sure that the scheduler has the best estimates of the recipient +counts for those jobs. As we mentioned above, this means that we +want to try to make sure that the messages of those jobs have all +recipients read in-core. Therefore the transport distributes the +slots "along" the job list from start to end. In this case the job +list sorted by message enqueue time is used, because it doesn't +change over time as the scheduler's job list does. + +</p> + +<p> + +More specifically, each time a job is created and appended to the +job list, it gets all unused recipient slots from its transport's +pool. It keeps them until all recipients of its message are read. +When this happens, all unused recipient slots are transferred to +the next job (which is now in fact the first such job) on the job +list which still has some recipients unread, or eventually back to +the transport pool if there is no such job. Such a transfer then also +happens whenever a recipient entry of that job is delivered. + +</p> + +<p> + +There is also a scenario when a job is not appended to the end of +the job list (for example it was created as a result of a second or +later recipient batch). Then it works exactly as above, except that +if it was put in front of the first unread job (that is, the job +of a message which still has some unread recipients in the queue file), +that job is first forced to return all of its unused recipient slots +to the transport pool. + +</p> + +<p> + +The algorithm just described leads to the following state: The first +unread job on the job list always gets all the remaining recipient +slots of that transport (if there are any). The jobs queued before +this job are completely read (that is, all recipients of their +message were already read in core) and have at maximum as many slots +as they still have recipients in-core (the maximum is there because +of the sponsoring mentioned before) and the jobs after this job get +nothing from the transport recipient pool (unless they got something +before and then the first unread job was created and enqueued in +front of them later - in such a case, they also get at maximum as many +slots as they have recipients in-core). + +</p> + +<p> + +Things work fine in such a state for most of the time, because the +current job is either completely read in-core or has as many recipient +slots as there are, but there is one situation which we still have +to take care of specially. Imagine if the current job is preempted +by some unread job from the job list and there are no more recipient +slots available, so this new current job could read only batches +of qmgr_message_recipient_minimum recipients at a time. This would +really degrade performance. For this reason, each transport has an +extra pool of <i>transport</i>_extra_recipient_limit recipient +slots, dedicated exactly for this situation. Each time an unread +job preempts the current job, it gets half of the remaining recipient +slots from the normal pool and this extra pool. + +</p> + +<p> + +And that's it. It sure does sound pretty complicated, but fortunately +most people don't really have to care exactly how it works as long +as it works. Perhaps the only important things to know for most +people are the following upper bound formulas: + +</p> + +<p> + +Each transport has at maximum + +</p> + +<blockquote> +<pre> +max( +qmgr_message_recipient_minimum * qmgr_message_active_limit ++ *_recipient_limit + *_extra_recipient_limit, +qmgr_message_recipient_limit +) +</pre> +</blockquote> + +<p> + +recipients in core. + +</p> + +<p> + +The total amount of recipients in core is + +</p> + +<blockquote> +<pre> +max( +qmgr_message_recipient_minimum * qmgr_message_active_limit ++ sum( *_recipient_limit + *_extra_recipient_limit ), +qmgr_message_recipient_limit +) +</pre> +</blockquote> + +<p> + +where the sum is over all used transports. + +</p> + +<p> + +And this terribly complicated chapter concludes the documentation +of the <tt>nqmgr</tt> scheduler. + +</p> + +<p> + +[By now you should theoretically know the <tt>nqmgr</tt> scheduler +inside out. In practice, you still hope that you will never have +to really understand the last or last two chapters completely, and +fortunately most people really won't. Understanding how the scheduler +works in ideal conditions is more than good enough for the vast majority +of users.] + +</p> + +<h2> <a name="credits"> Credits </a> </h2> + +<ul> + +<li> Wietse Venema designed and implemented the initial queue manager +with per-domain FIFO scheduling, and per-delivery +/-1 concurrency +feedback. + +<li> Patrik Rak designed and implemented preemption where mail with +fewer recipients can slip past mail with more recipients in a +controlled manner, and wrote up its documentation. + +<li> Wietse Venema initiated a discussion with Patrik Rak and Victor +Duchovni on alternatives for the +/-1 feedback scheduler's aggressive +behavior. This is when K/N feedback was reviewed (N = concurrency). +The discussion ended without a good solution for both negative +feedback and dead site detection. + +<li> Victor Duchovni resumed work on concurrency feedback in the +context of concurrency-limited servers. + +<li> Wietse Venema then re-designed the concurrency scheduler in +terms of the simplest possible concepts: less-than-1 concurrency +feedback per delivery, forward and reverse concurrency feedback +hysteresis, and pseudo-cohort failure. At this same time, concurrency +feedback was separated from dead site detection. + +<li> These simplifications, and their modular implementation, helped +to develop further insights into the different roles that positive +and negative concurrency feedback play, and helped to identify some +worst-case scenarios. + +</ul> + +</body> + +</html> |