diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-10 19:59:03 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-10 19:59:03 +0000 |
commit | a848231ae0f346dc7cc000973fbeb65b0894ee92 (patch) | |
tree | 44b60b367c86723cc78383ef247885d72b388afe /README_FILES/SCHEDULER_README | |
parent | Initial commit. (diff) | |
download | postfix-a848231ae0f346dc7cc000973fbeb65b0894ee92.tar.xz postfix-a848231ae0f346dc7cc000973fbeb65b0894ee92.zip |
Adding upstream version 3.8.5.upstream/3.8.5
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'README_FILES/SCHEDULER_README')
-rw-r--r-- | README_FILES/SCHEDULER_README | 1161 |
1 files changed, 1161 insertions, 0 deletions
diff --git a/README_FILES/SCHEDULER_README b/README_FILES/SCHEDULER_README new file mode 100644 index 0000000..448ee0a --- /dev/null +++ b/README_FILES/SCHEDULER_README @@ -0,0 +1,1161 @@ +PPoossttffiixx QQuueeuuee SScchheedduulleerr + +------------------------------------------------------------------------------- + +DDiissccllaaiimmeerr + +Many of the transport-specific configuration parameters discussed in this +document will not show up in "postconf" command output before Postfix version +2.9. This limitation applies to many parameters whose name is a combination of +a master.cf service name such as "relay" and a built-in suffix such as +"_destination_concurrency_limit". + +OOvveerrvviieeww + +The queue manager is by far the most complex part of the Postfix mail system. +It schedules delivery of new mail, retries failed deliveries at specific times, +and removes mail from the queue after the last delivery attempt. There are two +major classes of mechanisms that control the operation of the queue manager. + +Topics covered by this document: + + * Concurrency scheduling, concerned with the number of concurrent deliveries + to a specific destination, including decisions on when to suspend + deliveries after persistent failures. + * Preemptive scheduling, concerned with the selection of email messages and + recipients for a given destination. + * Credits, something this document would not be complete without. + +CCoonnccuurrrreennccyy sscchheedduulliinngg + +The following sections document the Postfix 2.5 concurrency scheduler, after a +discussion of the limitations of the earlier concurrency scheduler. This is +followed by results of medium-concurrency experiments, and a discussion of +trade-offs between performance and robustness. + +The material is organized as follows: + + * Drawbacks of the existing concurrency scheduler + * Summary of the Postfix 2.5 concurrency feedback algorithm + * Summary of the Postfix 2.5 "dead destination" detection algorithm + * Pseudocode for the Postfix 2.5 concurrency scheduler + * Results for delivery to concurrency limited servers + * Discussion of concurrency limited server results + * Limitations of less-than-1 per delivery feedback + * Concurrency configuration parameters + +DDrraawwbbaacckkss ooff tthhee eexxiissttiinngg ccoonnccuurrrreennccyy sscchheedduulleerr + +From the start, Postfix has used a simple but robust algorithm where the per- +destination delivery concurrency is decremented by 1 after delivery failed due +to connection or handshake failure, and incremented by 1 otherwise. Of course +the concurrency is never allowed to exceed the maximum per-destination +concurrency limit. And when a destination's concurrency level drops to zero, +the destination is declared "dead" and delivery is suspended. + +Drawbacks of +/-1 concurrency feedback per delivery are: + + * Overshoot due to exponential delivery concurrency growth with each pseudo- + cohort(*). This can be an issue with high-concurrency channels. For + example, with the default initial concurrency of 5, concurrency would + proceed over time as (5-10-20). + + * Throttling down to zero concurrency after a single pseudo-cohort(*) + failure. This was especially an issue with low-concurrency channels where a + single failure could be sufficient to mark a destination as "dead", causing + the suspension of further deliveries to the affected destination. + +(*) A pseudo-cohort is a number of delivery requests equal to a destination's +delivery concurrency. + +The revised concurrency scheduler has a highly modular structure. It uses +separate mechanisms for per-destination concurrency control and for "dead +destination" detection. The concurrency control in turn is built from two +separate mechanisms: it supports less-than-1 feedback per delivery to allow for +more gradual concurrency adjustments, and it uses feedback hysteresis to +suppress concurrency oscillations. And instead of waiting for delivery +concurrency to throttle down to zero, a destination is declared "dead" after a +configurable number of pseudo-cohorts reports connection or handshake failure. + +SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy ffeeeeddbbaacckk aallggoorriitthhmm + +We want to increment a destination's delivery concurrency when some (not +necessarily consecutive) number of deliveries complete without connection or +handshake failure. This is implemented with positive feedback g(N) where N is +the destination's delivery concurrency. With g(N)=1 feedback per delivery, +concurrency increases by 1 after each positive feedback event; this gives us +the old scheduler's exponential growth in time. With g(N)=1/N feedback per +delivery, concurrency increases by 1 after an entire pseudo-cohort N of +positive feedback reports; this gives us linear growth in time. Less-than- +1 feedback per delivery and integer truncation naturally give us hysteresis, so +that transitions to larger concurrency happen every 1/g(N) positive feedback +events. + +We want to decrement a destination's delivery concurrency when some (not +necessarily consecutive) number of deliveries complete after connection or +handshake failure. This is implemented with negative feedback f(N) where N is +the destination's delivery concurrency. With f(N)=1 feedback per delivery, +concurrency decreases by 1 after each negative feedback event; this gives us +the old scheduler's behavior where concurrency is throttled down dramatically +after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery, +concurrency backs off more gently. Again, less-than-1 feedback per delivery and +integer truncation naturally give us hysteresis, so that transitions to lower +concurrency happen every 1/f(N) negative feedback events. + +However, with negative feedback we introduce a subtle twist. We "reverse" the +negative hysteresis cycle so that the transition to lower concurrency happens +at the bbeeggiinnnniinngg of a sequence of 1/f(N) negative feedback events. Otherwise, a +correction for overload would be made too late. This makes the choice of f(N) +relatively unimportant, as borne out by measurements later in this document. + +In summary, the main ingredients for the Postfix 2.5 concurrency feedback +algorithm are a) the option of less-than-1 positive feedback per delivery to +avoid overwhelming servers, b) the option of less-than-1 negative feedback per +delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid +oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that +it can correct for overload quickly. + +SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ""ddeeaadd ddeessttiinnaattiioonn"" ddeetteeccttiioonn aallggoorriitthhmm + +We want to suspend deliveries to a specific destination after some number of +deliveries suffers connection or handshake failure. The old scheduler declares +a destination "dead" when negative (-1) feedback throttles the delivery +concurrency down to zero. With less-than-1 feedback per delivery, this +throttling down would obviously take too long. We therefore have to separate +"dead destination" detection from concurrency feedback. This is implemented by +introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency +scheduler declares a destination "dead" after a configurable number of pseudo- +cohorts suffers from connection or handshake failures. The old scheduler +corresponds to the special case where the pseudo-cohort failure limit is equal +to 1. + +PPsseeuuddooccooddee ffoorr tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy sscchheedduulleerr + +The pseudo code shows how the ideas behind new concurrency scheduler are +implemented as of November 2007. The actual code can be found in the module +qmgr/qmgr_queue.c. + +Types: + Each destination has one set of the following variables + int concurrency + double success + double failure + double fail_cohorts + +Feedback functions: + N is concurrency; x, y are arbitrary numbers in [0..1] inclusive + positive feedback: g(N) = x/N | x/sqrt(N) | x + negative feedback: f(N) = y/N | y/sqrt(N) | y + +Initialization: + concurrency = initial_concurrency + success = 0 + failure = 0 + fail_cohorts = 0 + +After success: + fail_cohorts = 0 + Be prepared for feedback > hysteresis, or rounding error + success += g(concurrency) + while (success >= 1) Hysteresis 1 + concurrency += 1 Hysteresis 1 + failure = 0 + success -= 1 Hysteresis 1 + Be prepared for overshoot + if (concurrency > concurrency limit) + concurrency = concurrency limit + +Safety: + Don't apply positive feedback unless + concurrency < busy_refcount + init_dest_concurrency + otherwise negative feedback effect could be delayed + +After failure: + if (concurrency > 0) + fail_cohorts += 1.0 / concurrency + if (fail_cohorts > cohort_failure_limit) + concurrency = 0 + if (concurrency > 0) + Be prepared for feedback > hysteresis, rounding errors + failure -= f(concurrency) + while (failure < 0) + concurrency -= 1 Hysteresis 1 + failure += 1 Hysteresis 1 + success = 0 + Be prepared for overshoot + if (concurrency < 1) + concurrency = 1 + +RReessuullttss ffoorr ddeelliivveerryy ttoo ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerrss + +Discussions about the concurrency scheduler redesign started early 2004, when +the primary goal was to find alternatives that did not exhibit exponential +growth or rapid concurrency throttling. No code was implemented until late +2007, when the primary concern had shifted towards better handling of server +concurrency limits. For this reason we measure how well the new scheduler does +this job. The table below compares mail delivery performance of the old +/- +1 feedback per delivery with several less-than-1 feedback functions, for +different limited-concurrency server scenarios. Measurements were done with a +FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers. + +Server configuration: + + * The mail flow was slowed down with 1 second latency per recipient + ("smtpd_client_restrictions = sleep 1"). The purpose was to make results + less dependent on hardware details, by avoiding slow-downs by queue file I/ + O, logging I/O, and network I/O. + * Concurrency was limited by the server process limit ("default_process_limit + = 5" and "smtpd_client_event_limit_exceptions = static:all"). Postfix was + stopped and started after changing the process limit, because the same + number is also used as the backlog argument to the listen(2) system call, + and "postfix reload" does not re-issue this call. + * Mail was discarded with "local_recipient_maps = static:all" and + "local_transport = discard". The discard action in access maps or header/ + body checks could not be used as it fails to update the in_flow_delay + counters. + +Client configuration: + + * Queue file overhead was minimized by sending one message to a virtual alias + that expanded into 2000 different remote recipients. All recipients were + accounted for according to the maillog file. The + virtual_alias_expansion_limit setting was increased to avoid complaints + from the cleanup(8) server. + * The number of deliveries was maximized with + "smtp_destination_recipient_limit = 2". A smaller limit would cause Postfix + to schedule the concurrency per recipient instead of domain, which is not + what we want. + * Maximum concurrency was limited with "smtp_destination_concurrency_limit = + 20", and initial_destination_concurrency was set to the same value. + * The positive and negative concurrency feedback hysteresis was 1. + Concurrency was incremented by 1 at the END of 1/feedback steps of positive + feedback, and was decremented by 1 at the START of 1/feedback steps of + negative feedback. + * The SMTP client used the default 30s SMTP connect timeout and 300s SMTP + greeting timeout. + +IImmppaacctt ooff tthhee 3300ss SSMMTTPP ccoonnnneecctt ttiimmeeoouutt + +The first results are for a FreeBSD 6.2 server, where our artificially low +listen(2) backlog results in a very short kernel queue for established +connections. The table shows that all deferred deliveries failed due to a 30s +connection timeout, and none failed due to a server greeting timeout. This +measurement simulates what happens when the server's connection queue is +completely full under load, and the TCP engine drops new connections. + + cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn + lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt// + aavveerraaggee//ssttddddeevv ggrreeeettiinngg + + ------------------------------------------------------------------------- + 20 5 1/N no 9.9 19.4 0.49 198 - + + 20 5 1/N yes 10.3 19.4 0.49 206 - + + 20 5 1/sqrt(N) no 10.4 19.6 0.59 208 - + + 20 5 1/sqrt(N) yes 10.6 19.6 0.61 212 - + + 20 5 1 no 10.1 19.5 1.29 202 - + + 20 5 1 yes 10.8 19.3 1.57 216 - + + ------------------------------------------------------------------------- + + A busy server with a completely full connection queue. N is the client + delivery concurrency. Failed deliveries time out after 30s without + completing the TCP handshake. See text for a discussion of results. + +IImmppaacctt ooff tthhee 330000ss SSMMTTPP ggrreeeettiinngg ttiimmeeoouutt + +The next table shows results for a Fedora Core 8 server (results for RedHat 7.3 +are identical). In this case, the artificially small listen(2) backlog argument +does not impact our measurement. The table shows that practically all deferred +deliveries fail after the 300s SMTP greeting timeout. As these timeouts were +10x longer than with the first measurement, we increased the recipient count +(and thus the running time) by a factor of 10 to keep the results comparable. +The deferred mail percentages are a factor 10 lower than with the first +measurement, because the 1s per-recipient delay was 1/300th of the greeting +timeout instead of 1/30th of the connection timeout. + + cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn + lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt// + aavveerraaggee//ssttddddeevv ggrreeeettiinngg + + ------------------------------------------------------------------------- + 20 5 1/N no 1.16 19.8 0.37 - 230 + + 20 5 1/N yes 1.36 19.8 0.36 - 272 + + 20 5 1/sqrt(N) no 1.21 19.9 0.23 4 238 + + 20 5 1/sqrt(N) yes 1.36 20.0 0.23 - 272 + + 20 5 1 no 1.18 20.0 0.16 - 236 + + 20 5 1 yes 1.39 20.0 0.16 - 278 + + ------------------------------------------------------------------------- + + A busy server with a non-full connection queue. N is the client delivery + concurrency. Failed deliveries complete at the TCP level, but time out + after 300s while waiting for the SMTP greeting. See text for a discussion + of results. + +IImmppaacctt ooff aaccttiivvee sseerrvveerr ccoonnccuurrrreennccyy lliimmiitteerr + +The final concurrency-limited result shows what happens when SMTP connections +don't time out, but are rejected immediately with the Postfix server's +smtpd_client_connection_count_limit feature (the server replies with a 421 +status and disconnects immediately). Similar results can be expected with +concurrency limiting features built into other MTAs or firewalls. For this +measurement we specified a server concurrency limit and a client initial +destination concurrency of 5, and a server process limit of 10; all other +conditions were the same as with the first measurement. The same result would +be obtained with a FreeBSD or Linux server, because the "pushing back" is done +entirely by the receiving side. + + cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt tthheeoorreettiiccaall + lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ddeeffeerr rraattee + aavveerraaggee//ssttddddeevv + + ------------------------------------------------------------------------- + 20 5 1/N no 16.5 5.17 0.38 1/6 + + 20 5 1/N yes 16.5 5.17 0.38 1/6 + + 20 5 1/sqrt(N) no 24.5 5.28 0.45 1/4 + + 20 5 1/sqrt(N) yes 24.3 5.28 0.46 1/4 + + 20 5 1 no 49.7 5.63 0.67 1/2 + + 20 5 1 yes 49.7 5.68 0.70 1/2 + + ------------------------------------------------------------------------- + + A server with active per-client concurrency limiter that replies with 421 + and disconnects. N is the client delivery concurrency. The theoretical + defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed + +/-1 feedback per delivery; with the concurrency-dependent feedback + variants, the defer rate decreases with increasing concurrency. See text + for a discussion of results. + +DDiissccuussssiioonn ooff ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerr rreessuullttss + +All results in the previous sections are based on the first delivery runs only; +they do not include any second etc. delivery attempts. It's also worth noting +that the measurements look at steady-state behavior only. They don't show what +happens when the client starts sending at a much higher or lower concurrency. + +The first two examples show that the effect of feedback is negligible when +concurrency is limited due to congestion. This is because the initial +concurrency is already at the client's concurrency maximum, and because there +is 10-100 times more positive than negative feedback. Under these conditions, +it is no surprise that the contribution from SMTP connection caching is also +negligible. + +In the last example, the old +/-1 feedback per delivery will defer 50% of the +mail when confronted with an active (anvil-style) server concurrency limit, +where the server hangs up immediately with a 421 status (a TCP-level RST would +have the same result). Less aggressive feedback mechanisms fare better than +more aggressive ones. Concurrency-dependent feedback fares even better at +higher concurrencies than shown here, but has limitations as discussed in the +next section. + +LLiimmiittaattiioonnss ooff lleessss--tthhaann--11 ppeerr ddeelliivveerryy ffeeeeddbbaacckk + +Less-than-1 feedback is of interest primarily when sending large amounts of +mail to destinations with active concurrency limiters (servers that reply with +421, or firewalls that send RST). When sending small amounts of mail per +destination, less-than-1 per-delivery feedback won't have a noticeable effect +on the per-destination concurrency, because the number of deliveries to the +same destination is too small. You might just as well use zero per-delivery +feedback and stay with the initial per-destination concurrency. And when mail +deliveries fail due to congestion instead of active concurrency limiters, the +measurements above show that per-delivery feedback has no effect. With large +amounts of mail you might just as well use zero per-delivery feedback and start +with the maximal per-destination concurrency. + +The scheduler with less-than-1 concurrency feedback per delivery solves a +problem with servers that have active concurrency limiters. This works only +because feedback is handled in a peculiar manner: positive feedback will +increment the concurrency by 1 at the eenndd of a sequence of events of length 1/ +feedback, while negative feedback will decrement concurrency by 1 at the +bbeeggiinnnniinngg of such a sequence. This is how Postfix adjusts quickly for overshoot +without causing lots of mail to be deferred. Without this difference in +feedback treatment, less-than-1 feedback per delivery would defer 50% of the +mail, and would be no better in this respect than the old +/-1 feedback per +delivery. + +Unfortunately, the same feature that corrects quickly for concurrency overshoot +also makes the scheduler more sensitive for noisy negative feedback. The reason +is that one lonely negative feedback event has the same effect as a complete +sequence of length 1/feedback: in both cases delivery concurrency is dropped by +1 immediately. As a worst-case scenario, consider multiple servers behind a +load balancer on a single IP address, and no backup MX address. When 1 out of K +servers fails to complete the SMTP handshake or drops the connection, a +scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency +once it reaches a concurrency level of about K, even though the good servers +behind the load balancer are perfectly capable of handling more traffic. + +This noise problem gets worse as the amount of positive feedback per delivery +gets smaller. A compromise is to use fixed less-than-1 positive feedback values +instead of concurrency-dependent positive feedback. For example, to tolerate 1 +of 4 bad servers in the above load balancer scenario, use positive feedback of +1/4 per "good" delivery (no connect or handshake error), and use an equal or +smaller amount of negative feedback per "bad" delivery. The downside of using +concurrency-independent feedback is that some of the old +/-1 feedback problems +will return at large concurrencies. Sites that must deliver mail at non-trivial +per-destination concurrencies will require special configuration. + +CCoonnccuurrrreennccyy ccoonnffiigguurraattiioonn ppaarraammeetteerrss + +The Postfix 2.5 concurrency scheduler is controlled with the following +configuration parameters, where "transport_foo" provides a transport-specific +parameter override. All parameter default settings are compatible with earlier +Postfix versions. + + PPaarraammeetteerr nnaammee PPoossttffiixx DDeessccrriippttiioonn + vveerrssiioonn + + --------------------------------------------------------------------------- + Initial per- + initial_destination_concurrency all destination + transport_initial_destination_concurrency 2.5 delivery + concurrency + + Maximum per- + default_destination_concurrency_limit all destination + transport_destination_concurrency_limit all delivery + concurrency + + Per- + destination + positive + feedback + default_destination_concurrency_positive_feedback 2.5 amount, per + transport_destination_concurrency_positive_feedback 2.5 delivery that + does not fail + with + connection or + handshake + failure + + Per- + destination + negative + feedback + default_destination_concurrency_negative_feedback 2.5 amount, per + transport_destination_concurrency_negative_feedback 2.5 delivery that + fails with + connection or + handshake + failure + + Number of + failed + pseudo- + cohorts after + default_destination_concurrency_failed_cohort_limit 2.5 which a + transport_destination_concurrency_failed_cohort_limit 2.5 destination + is declared + "dead" and + delivery is + suspended + + Enable + verbose + destination_concurrency_feedback_debug 2.5 logging of + concurrency + scheduler + activity + + --------------------------------------------------------------------------- + +PPrreeeemmppttiivvee sscchheedduulliinngg + +The following sections describe the new queue manager and its preemptive +scheduler algorithm. Note that the document was originally written to describe +the changes between the new queue manager (in this text referred to as nqmgr, +the name it was known by before it became the default queue manager) and the +old queue manager (referred to as oqmgr). This is why it refers to oqmgr every +so often. + +This document is divided into sections as follows: + + * The structures used by nqmgr + * What happens when nqmgr picks up the message - how it is assigned to + transports, jobs, peers, entries + * How the entry selection works + * How the preemption works - what messages may be preempted and how and what + messages are chosen to preempt them + * How destination concurrency limits affect the scheduling algorithm + * Dealing with memory resource limits + +TThhee ssttrruuccttuurreess uusseedd bbyy nnqqmmggrr + +Let's start by recapitulating the structures and terms used when referring to +the queue manager and how it operates. Many of these are partially described +elsewhere, but it is nice to have a coherent overview in one place: + + * Each message structure represents one mail message which Postfix is to + deliver. The message recipients specify to what destinations is the message + to be delivered and what transports are going to be used for the delivery. + + * Each recipient entry groups a batch of recipients of one message which are + all going to be delivered to the same destination (and over the same + transport). + + * Each transport structure groups everything what is going to be delivered by + delivery agents dedicated for that transport. Each transport maintains a + set of queues (describing the destinations it shall talk to) and jobs + (referencing the messages it shall deliver). + + * Each transport queue (not to be confused with the on-disk "active" queue or + "incoming" queue) groups everything what is going be delivered to given + destination (aka nexthop) by its transport. Each queue belongs to one + transport, so each destination may be referred to by several queues, one + for each transport. Each queue maintains a list of all recipient entries + (batches of message recipients) which shall be delivered to given + destination (the todo list), and a list of recipient entries already being + delivered by the delivery agents (the busy list). + + * Each queue corresponds to multiple peer structures. Each peer structure is + like the queue structure, belonging to one transport and referencing one + destination. The difference is that it lists only the recipient entries + which all originate from the same message, unlike the queue structure, + whose entries may originate from various messages. For messages with few + recipients, there is usually just one recipient entry for each destination, + resulting in one recipient entry per peer. But for large mailing list + messages the recipients may need to be split to multiple recipient entries, + in which case the peer structure may list many entries for single + destination. + + * Each transport job groups everything it takes to deliver one message via + its transport. Each job represents one message within the context of the + transport. The job belongs to one transport and message, so each message + may have multiple jobs, one for each transport. The job groups all the peer + structures, which describe the destinations the job's message has to be + delivered to. + +The first four structures are common to both nqmgr and oqmgr, the latter two +were introduced by nqmgr. + +These terms are used extensively in the text below, feel free to look up the +description above anytime you'll feel you have lost a sense what is what. + +WWhhaatt hhaappppeennss wwhheenn nnqqmmggrr ppiicckkss uupp tthhee mmeessssaaggee + +Whenever nqmgr moves a queue file into the "active" queue, the following +happens: It reads all necessary information from the queue file as oqmgr does, +and also reads as many recipients as possible - more on that later, for now +let's just pretend it always reads all recipients. + +Then it resolves the recipients as oqmgr does, which means obtaining (address, +nexthop, transport) triple for each recipient. For each triple, it finds the +transport; if it does not exist yet, it instantiates it (unless it's dead). +Within the transport, it finds the destination queue for the given nexthop; if +it does not exist yet, it instantiates it (unless it's dead). The triple is +then bound to given destination queue. This happens in qmgr_resolve() and is +basically the same as in oqmgr. + +Then for each triple which was bound to some queue (and thus transport), the +program finds the job which represents the message within that transport's +context; if it does not exist yet, it instantiates it. Within the job, it finds +the peer which represents the bound destination queue within this jobs context; +if it does not exist yet, it instantiates it. Finally, it stores the address +from the resolved triple to the recipient entry which is appended to both the +queue entry list and the peer entry list. The addresses for the same nexthop +are batched in the entries up to the transport_destination_recipient_limit for +that transport. This happens in qmgr_message_assign(), and apart from that it +operates with job and peer structures, it is basically the same as in oqmgr. + +When the job is instantiated, it is enqueued on the transport's job list based +on the time its message was picked up by nqmgr. For first batch of recipients +this means it is appended to the end of the job list, but the ordering of the +job list by the enqueue time is important as we will see shortly. + +[Now you should have a pretty good idea what the state of the nqmgr is after a +couple of messages were picked up, and what the relation is between all those +job, peer, queue and entry structures.] + +HHooww tthhee eennttrryy sseelleeccttiioonn wwoorrkkss + +Having prepared all those above mentioned structures, the task of the nqmgr's +scheduler is to choose the recipient entries one at a time and pass them to the +delivery agent for corresponding transport. Now how does this work? + +The first approximation of the new scheduling algorithm is like this: + + foreach transport (round-robin-by-transport) + do + if transport busy continue + if transport process limit reached continue + foreach transport's job (in the order of the transport's job list) + do + foreach job's peer (round-robin-by-destination) + if peer->queue->concurrency < peer->queue->window + return next peer entry. + done + done + done + +Now what is the "order of the transport's job list"? As we know already, the +job list is by default kept in the order the message was picked up by the +nqmgr. So by default we get the top-level round-robin transport, and within +each transport we get the FIFO message delivery. The round-robin of the peers +by the destination is perhaps of little importance in most real-life cases +(unless the transport_destination_recipient_limit is reached, in one job there +is only one peer structure for each destination), but theoretically it makes +sure that even within single jobs, destinations are treated fairly. + +[By now you should have a feeling you really know how the scheduler works, +except for the preemption, under ideal conditions - that is, no recipient +resource limits and no destination concurrency problems.] + +HHooww tthhee pprreeeemmppttiioonn wwoorrkkss + +As you might perhaps expect by now, the transport's job list does not remain +sorted by the job's message enqueue time all the time. The most cool thing +about nqmgr is not the simple FIFO delivery, but that it is able to slip mail +with little recipients past the mailing-list bulk mail. This is what the job +preemption is about - shuffling the jobs on the transport's job list to get the +best message delivery rates. Now how is it achieved? + +First I have to tell you that there are in fact two job lists in each +transport. One is the scheduler's job list, which the scheduler is free to play +with, while the other one keeps the jobs always listed in the order of the +enqueue time and is used for recipient pool management we will discuss later. +For now, we will deal with the scheduler's job list only. + +So, we have the job list, which is first ordered by the time the jobs' messages +were enqueued, oldest messages first, the most recently picked one at the end. +For now, let's assume that there are no destination concurrency problems. +Without preemption, we pick some entry of the first (oldest) job on the queue, +assign it to delivery agent, pick another one from the same job, assign it +again, and so on, until all the entries are used and the job is delivered. We +would then move onto the next job and so on and on. Now how do we manage to +sneak in some entries from the recently added jobs when the first job on the +job list belongs to a message going to the mailing-list and has thousands of +recipient entries? + +The nqmgr's answer is that we can artificially "inflate" the delivery time of +that first job by some constant for free - it is basically the same trick you +might remember as "accumulation of potential" from the amortized complexity +lessons. For example, instead of delivering the entries of the first job on the +job list every time a delivery agent becomes available, we can do it only every +second time. If you view the moments the delivery agent becomes available on a +timeline as "delivery slots", then instead of using every delivery slot for the +first job, we can use only every other slot, and still the overall delivery +efficiency of the first job remains the same. So the delivery 11112222 becomes +1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free +slot). Now what do we do with free slots? + +As you might have guessed, we will use them for sneaking the mail with little +recipients in. For example, if we have one four-recipient mail followed by four +one recipients mail, the delivery sequence (that is, the sequence in which the +jobs are assigned to the delivery slots) might look like this: 12131415. Hmm, +fine for sneaking in the single recipient mail, but how do we sneak in the mail +with more than one recipient? Say if we have one four-recipient mail followed +by two two-recipient mails? + +The simple answer would be to use delivery sequence 12121313. But the problem +is that this does not scale well. Imagine you have mail with a thousand +recipients followed by mail with a hundred recipients. It is tempting to +suggest the delivery sequence like 121212...., but alas! Imagine there arrives +another mail with say ten recipients. But there are no free slots anymore, so +it can't slip by, not even if it had only one recipient. It will be stuck until +the hundred-recipient mail is delivered, which really sucks. + +So, it becomes obvious that while inflating the message to get free slots is a +great idea, one has to be really careful of how the free slots are assigned, +otherwise one might corner himself. So, how does nqmgr really use the free +slots? + +The key idea is that one does not have to generate the free slots in a uniform +way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is +even better as some entries are in the first case selected earlier than in the +second case, and none is selected later! So it is possible to first +"accumulate" the free delivery slots and then use them all at once. It is even +possible to accumulate some, then use them, then accumulate some more and use +them again, as in 11..1.1 . + +Let's get back to the one hundred recipient example. We now know that we could +first accumulate one hundred free slots, and only after then to preempt the +first job and sneak the one hundred recipient mail in. Applying the algorithm +recursively, we see the hundred recipient job can accumulate ten free delivery +slots, and then we could preempt it and sneak in the ten-recipient mail... Wait +wait wait! Could we? Aren't we overinflating the original one thousand +recipient mail? + +Well, despite the fact that it looks so at the first glance, another trick will +allow us to answer "no, we are not!". If we had said that we will inflate the +delivery time twice at maximum, and then we consider every other slot as a free +slot, then we would overinflate in case of the recursive preemption. BUT! The +trick is that if we use only every n-th slot as a free slot for n>2, there is +always some worst inflation factor which we can guarantee not to be breached, +even if we apply the algorithm recursively. To be precise, if for every k>1 +normally used slots we accumulate one free delivery slot, than the inflation +factor is not worse than k/(k-1) no matter how many recursive preemptions +happen. And it's not worse than (k+1)/k if only non-recursive preemption +happens. Now, having got through the theory and the related math, let's see how +nqmgr implements this. + +Each job has so called "available delivery slot" counter. Each transport has a +transport_delivery_slot_cost parameter, which defaults to +default_delivery_slot_cost parameter which is set to 5 by default. This is the +k from the paragraph above. Each time k entries of the job are selected for +delivery, this counter is incremented by one. Once there are some slots +accumulated, a job which requires no more than that number of slots to be fully +delivered can preempt this job. + +[Well, the truth is, the counter is incremented every time an entry is selected +and it is divided by k when it is used. But to understand, it's good enough to +use the above approximation of the truth.] + +OK, so now we know the conditions which must be satisfied so one job can +preempt another one. But what job gets preempted, how do we choose what job +preempts it if there are several valid candidates, and when does all this +exactly happen? + +The answer for the first part is simple. The job whose entry was selected the +last time is the so called current job. Normally, it is the first job on the +scheduler's job list, but destination concurrency limits may change this as we +will see later. It is always only the current job which may get preempted. + +Now for the second part. The current job has a certain amount of recipient +entries, and as such may accumulate at maximum some amount of available +delivery slots. It might have already accumulated some, and perhaps even +already used some when it was preempted before (remember a job can be preempted +several times). In either case, we know how many are accumulated and how many +are left to deliver, so we know how many it may yet accumulate at maximum. +Every other job which may be delivered by less than that number of slots is a +valid candidate for preemption. How do we choose among them? + +The answer is - the one with maximum enqueue_time/recipient_entry_count. That +is, the older the job is, the more we should try to deliver it in order to get +best message delivery rates. These rates are of course subject to how many +recipients the message has, therefore the division by the recipient (entry) +count. No one shall be surprised that a message with n recipients takes n times +longer to deliver than a message with one recipient. + +Now let's recap the previous two paragraphs. Isn't it too complicated? Why +don't the candidates come only among the jobs which can be delivered within the +number of slots the current job already accumulated? Why do we need to estimate +how much it has yet to accumulate? If you found out the answer, congratulate +yourself. If we did it this simple way, we would always choose the candidate +with the fewest recipient entries. If there were enough single recipient mails +coming in, they would always slip by the bulk mail as soon as possible, and the +two or more recipients mail would never get a chance, no matter how long they +have been sitting around in the job list. + +This candidate selection has an interesting implication - that when we choose +the best candidate for preemption (this is done in qmgr_choose_candidate()), it +may happen that we may not use it for preemption immediately. This leads to an +answer to the last part of the original question - when does the preemption +happen? + +The preemption attempt happens every time next transport's recipient entry is +to be chosen for delivery. To avoid needless overhead, the preemption is not +attempted if the current job could never accumulate more than +transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots +which defaults to 3). If there are already enough accumulated slots to preempt +the current job by the chosen best candidate, it is done immediately. This +basically means that the candidate is moved in front of the current job on the +scheduler's job list and decreasing the accumulated slot counter by the amount +used by the candidate. If there are not enough slots... well, I could say that +nothing happens and the another preemption is attempted the next time. But +that's not the complete truth. + +The truth is that it turns out that it is not really necessary to wait until +the jobs counter accumulates all the delivery slots in advance. Say we have +ten-recipient mail followed by two two-recipient mails. If the preemption +happened when enough delivery slots accumulate (assuming slot cost 2), the +delivery sequence becomes 11112211113311. Now what would we get if we would +wait only for 50% of the necessary slots to accumulate and we promise we would +wait for the remaining 50% later, after we get back to the preempted job? If we +use such a slot loan, the delivery sequence becomes 11221111331111. As we can +see, it makes it not considerably worse for the delivery of the ten-recipient +mail, but it allows the small messages to be delivered sooner. + +The concept of these slot loans is where the transport_delivery_slot_discount +and transport_delivery_slot_loan come from (they default to +default_delivery_slot_discount and default_delivery_slot_loan, whose values are +by default 50 and 3, respectively). The discount (resp. loan) specifies how +many percent (resp. how many slots) one "gets in advance", when the number of +slots required to deliver the best candidate is compared with the number of +slots the current slot had accumulated so far. + +And that pretty much concludes this chapter. + +[Now you should have a feeling that you pretty much understand the scheduler +and the preemption, or at least that you will have after you read the last +chapter a couple more times. You shall clearly see the job list and the +preemption happening at its head, in ideal delivery conditions. The feeling of +understanding shall last until you start wondering what happens if some of the +jobs are blocked, which you might eventually figure out correctly from what had +been said already. But I would be surprised if your mental image of the +scheduler's functionality is not completely shattered once you start wondering +how it works when not all recipients may be read in-core. More on that later.] + +HHooww ddeessttiinnaattiioonn ccoonnccuurrrreennccyy lliimmiittss aaffffeecctt tthhee sscchheedduulliinngg aallggoorriitthhmm + +The nqmgr uses the same algorithm for destination concurrency control as oqmgr. +Now what happens when the destination limits are reached and no more entries +for that destination may be selected by the scheduler? + +From the user's point of view it is all simple. If some of the peers of a job +can't be selected, those peers are simply skipped by the entry selection +algorithm (the pseudo-code described before) and only the selectable ones are +used. If none of the peers may be selected, the job is declared a "blocker +job". Blocker jobs are skipped by the entry selection algorithm and they are +also excluded from the candidates for preemption of the current job. Thus the +scheduler effectively behaves as if the blocker jobs didn't exist on the job +list at all. As soon as at least one of the peers of a blocker job becomes +unblocked (that is, the delivery agent handling the delivery of the recipient +entry for the given destination successfully finishes), the job's blocker +status is removed and the job again participates in all further scheduler +actions normally. + +So the summary is that the users don't really have to be concerned about the +interaction of the destination limits and scheduling algorithm. It works well +on its own and there are no knobs they would need to control it. + +From a programmer's point of view, the blocker jobs complicate the scheduler +quite a lot. Without them, the jobs on the job list would be normally delivered +in strict FIFO order. If the current job is preempted, the job preempting it is +completely delivered unless it is preempted itself. Without blockers, the +current job is thus always either the first job on the job list, or the top of +the stack of jobs preempting the first job on the job list. + +The visualization of the job list and the preemption stack without blockers +would be like this: + + first job-> 1--2--3--5--6--8--... <- job list + on job list | + 4 <- preemption stack + | + current job-> 7 + +In the example above we see that job 1 was preempted by job 4 and then job 4 +was preempted by job 7. After job 7 is completed, remaining entries of job 4 +are selected, and once they are all selected, job 1 continues. + +As we see, it's all very clean and straightforward. Now how does this change +because of blockers? + +The answer is: a lot. Any job may become a blocker job at any time, and also +become a normal job again at any time. This has several important implications: + + 1. The jobs may be completed in arbitrary order. For example, in the example + above, if the current job 7 becomes blocked, the next job 4 may complete + before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked, + then 1 is completed, then 7 becomes unblocked and is completed, then 2 is + completed and only after that 4 becomes unblocked and is completed... You + get the idea. + + [Interesting side note: even when jobs are delivered out of order, from a + single destination's point of view the jobs are still delivered in the + expected order (that is, FIFO unless there was some preemption involved). + This is because whenever a destination queue becomes unblocked (the + destination limit allows selection of more recipient entries for that + destination), all jobs which have peers for that destination are unblocked + at once.] + + 2. The idea of the preemption stack at the head of the job list is gone. That + is, it must be possible to preempt any job on the job list. For example, if + the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3 + becomes the current job. And of course we do not want the preemption to be + affected by the fact that there are some blocked jobs or not. Therefore, if + it turns out that job 3 might be preempted by job 6, the implementation + shall make it possible. + + 3. The idea of the linear preemption stack itself is gone. It's no longer true + that one job is always preempted by only one job at one time (that is + directly preempted, not counting the recursively nested jobs). For example, + in the example above, job 1 is directly preempted by only job 4, and job 4 + by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered. + If it accumulates enough delivery slots, it is natural that it might be + preempted for example by job 8. Now job 4 is preempted by both job 7 AND + job 8 at the same time. + +Now combine the points 2) and 3) with point 1) again and you realize that the +relations on the once linear job list became pretty complicated. If we extend +the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked +too, then job 4 completes. Tricky, huh? + +If I illustrate the relations after the above mentioned examples (but those in +point 1), the situation would look like this: + + v- parent + + adoptive parent -> 1--2--3--5--... <- "stack" level 0 + | | + parent gone -> ? 6 <- "stack" level 1 + / \ + children -> 7 8 ^- child <- "stack" level 2 + + ^- siblings + +Now how does nqmgr deal with all these complicated relations? + +Well, it maintains them all as described, but fortunately, all these relations +are necessary only for the purpose of proper counting of available delivery +slots. For the purpose of ordering the jobs for entry selection, the original +rule still applies: "the job preempting the current job is moved in front of +the current job on the job list". So for entry selection purposes, the job +relations remain as simple as this: + + 7--8--1--2--6--3--5--.. <- scheduler's job list order + +The job list order and the preemption parent/child/siblings relations are +maintained separately. And because the selection works only with the job list, +you can happily forget about those complicated relations unless you want to +study the nqmgr sources. In that case the text above might provide some helpful +introduction to the problem domain. Otherwise I suggest you just forget about +all this and stick with the user's point of view: the blocker jobs are simply +ignored. + +[By now, you should have a feeling that there are more things going on under +the hood than you ever wanted to know. You decide that forgetting about this +chapter is the best you can do for the sake of your mind's health and you +basically stick with the idea how the scheduler works in ideal conditions, when +there are no blockers, which is good enough.] + +DDeeaalliinngg wwiitthh mmeemmoorryy rreessoouurrccee lliimmiittss + +When discussing the nqmgr scheduler, we have so far assumed that all recipients +of all messages in the "active" queue are completely read into memory. This is +simply not true. There is an upper bound on the amount of memory the nqmgr may +use, and therefore it must impose some limits on the information it may store +in memory at any given time. + +First of all, not all messages may be read in-core at once. At any time, only +qmgr_message_active_limit messages may be read in-core at maximum. When read +into memory, the messages are picked from the "incoming" and "deferred" queues +and moved to the "active" queue (incoming having priority), so if there are +more than qmgr_message_active_limit messages queued in the "active" queue, the +rest will have to wait until (some of) the messages in the "active" queue are +completely delivered (or deferred). + +Even with the limited amount of in-core messages, there is another limit which +must be imposed in order to avoid memory exhaustion. Each message may contain a +huge number of recipients (tens or hundreds of thousands are not uncommon), so +if nqmgr read all recipients of all messages in the "active" queue, it may +easily run out of memory. Therefore there must be some upper bound on the +amount of message recipients which are read into memory at the same time. + +Before discussing how exactly nqmgr implements the recipient limits, let's see +how the sole existence of the limits themselves affects the nqmgr and its +scheduler. + +The message limit is straightforward - it just limits the size of the lookahead +the nqmgr's scheduler has when choosing which message can preempt the current +one. Messages not in the "active" queue are simply not considered at all. + +The recipient limit complicates more things. First of all, the message reading +code must support reading the recipients in batches, which among other things +means accessing the queue file several times and continuing where the last +recipient batch ended. This is invoked by the scheduler whenever the current +job has space for more recipients, subject to transport's refill_limit and +refill_delay parameters. It is also done any time when all in-core recipients +of the message are dealt with (which may also mean they were deferred) but +there are still more in the queue file. + +The second complication is that with some recipients left unread in the queue +file, the scheduler can't operate with exact counts of recipient entries. With +unread recipients, it is not clear how many recipient entries there will be, as +they are subject to per-destination grouping. It is not even clear to what +transports (and thus jobs) the recipients will be assigned. And with messages +coming from the "deferred" queue, it is not even clear how many unread +recipients are still to be delivered. This all means that the scheduler must +use only estimates of how many recipients entries there will be. Fortunately, +it is possible to estimate the minimum and maximum correctly, so the scheduler +can always err on the safe side. Obviously, the better the estimates, the +better the results, so it is best when we are able to read all recipients in- +core and turn the estimates into exact counts, or at least try to read as many +as possible to make the estimates as accurate as possible. + +The third complication is that it is no longer true that the scheduler is done +with a job once all of its in-core recipients are delivered. It is possible +that the job will be revived later, when another batch of recipients is read in +core. It is also possible that some jobs will be created for the first time +long after the first batch of recipients was read in core. The nqmgr code must +be ready to handle all such situations. + +And finally, the fourth complication is that the nqmgr code must somehow impose +the recipient limit itself. Now how does it achieve it? + +Perhaps the easiest solution would be to say that each message may have at +maximum X recipients stored in-core, but such a solution would be poor for +several reasons. With reasonable qmgr_message_active_limit values, the X would +have to be quite low to maintain a reasonable memory footprint. And with low X +lots of things would not work well. The nqmgr would have problems to use the +transport_destination_recipient_limit efficiently. The scheduler's preemption +would be suboptimal as the recipient count estimates would be inaccurate. The +message queue file would have to be accessed many times to read in more +recipients again and again. + +Therefore it seems reasonable to have a solution which does not use a limit +imposed on a per-message basis, but which maintains a pool of available +recipient slots, which can be shared among all messages in the most efficient +manner. And as we do not want separate transports to compete for resources +whenever possible, it seems appropriate to maintain such a recipient pool for +each transport separately. This is the general idea, now how does it work in +practice? + +First we have to solve a little chicken-and-egg problem. If we want to use the +per-transport recipient pools, we first need to know to what transport(s) the +message is assigned. But we will find that out only after we first read in the +recipients. So it is obvious that we first have to read in some recipients, use +them to find out to what transports the message is to be assigned, and only +after that can we use the per-transport recipient pools. + +Now how many recipients shall we read for the first time? This is what +qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control. +The qmgr_message_recipient_minimum value specifies how many recipients of each +message we will read the first time, no matter what. It is necessary to read at +least one recipient before we can assign the message to a transport and create +the first job. However, reading only qmgr_message_recipient_minimum recipients +even if there are only few messages with few recipients in-core would be +wasteful. Therefore if there are fewer than qmgr_message_recipient_limit +recipients in-core so far, the first batch of recipients may be larger than +qmgr_message_recipient_minimum - as large as is required to reach the +qmgr_message_recipient_limit limit. + +Once the first batch of recipients was read in core and the message jobs were +created, the size of the subsequent recipient batches (if any - of course it's +best when all recipients are read in one batch) is based solely on the position +of the message jobs on their corresponding transports' job lists. Each +transport has a pool of transport_recipient_limit recipient slots which it can +distribute among its jobs (how this is done is described later). The subsequent +recipient batch may be as large as the sum of all recipient slots of all jobs +of the message permits (plus the qmgr_message_recipient_minimum amount which +always applies). + +For example, if a message has three jobs, the first with 1 recipient still in- +core and 4 recipient slots, the second with 5 recipients in-core and 5 +recipient slots, and the third with 2 recipients in-core and 0 recipient slots, +it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' recipients slots in total. +This means that we could immediately read 2+qmgr_message_recipient_minimum more +recipients of that message in core. + +The above example illustrates several things which might be worth mentioning +explicitly: first, note that although the per-transport slots are assigned to +particular jobs, we can't guarantee that once the next batch of recipients is +read in core, that the corresponding amounts of recipients will be assigned to +those jobs. The jobs lend its slots to the message as a whole, so it is +possible that some jobs end up sponsoring other jobs of their message. For +example, if in the example above the 2 newly read recipients were assigned to +the second job, the first job sponsored the second job with 2 slots. The second +notable thing is the third job, which has more recipients in-core than it has +slots. Apart from sponsoring by other job we just saw it can be result of the +first recipient batch, which is sponsored from global recipient pool of +qmgr_message_recipient_limit recipients. It can be also sponsored from the +message recipient pool of qmgr_message_recipient_minimum recipients. + +Now how does each transport distribute the recipient slots among its jobs? The +strategy is quite simple. As most scheduler activity happens on the head of the +job list, it is our intention to make sure that the scheduler has the best +estimates of the recipient counts for those jobs. As we mentioned above, this +means that we want to try to make sure that the messages of those jobs have all +recipients read in-core. Therefore the transport distributes the slots "along" +the job list from start to end. In this case the job list sorted by message +enqueue time is used, because it doesn't change over time as the scheduler's +job list does. + +More specifically, each time a job is created and appended to the job list, it +gets all unused recipient slots from its transport's pool. It keeps them until +all recipients of its message are read. When this happens, all unused recipient +slots are transferred to the next job (which is now in fact the first such job) +on the job list which still has some recipients unread, or eventually back to +the transport pool if there is no such job. Such a transfer then also happens +whenever a recipient entry of that job is delivered. + +There is also a scenario when a job is not appended to the end of the job list +(for example it was created as a result of a second or later recipient batch). +Then it works exactly as above, except that if it was put in front of the first +unread job (that is, the job of a message which still has some unread +recipients in the queue file), that job is first forced to return all of its +unused recipient slots to the transport pool. + +The algorithm just described leads to the following state: The first unread job +on the job list always gets all the remaining recipient slots of that transport +(if there are any). The jobs queued before this job are completely read (that +is, all recipients of their message were already read in core) and have at +maximum as many slots as they still have recipients in-core (the maximum is +there because of the sponsoring mentioned before) and the jobs after this job +get nothing from the transport recipient pool (unless they got something before +and then the first unread job was created and enqueued in front of them later - +in such a case, they also get at maximum as many slots as they have recipients +in-core). + +Things work fine in such a state for most of the time, because the current job +is either completely read in-core or has as many recipient slots as there are, +but there is one situation which we still have to take care of specially. +Imagine if the current job is preempted by some unread job from the job list +and there are no more recipient slots available, so this new current job could +read only batches of qmgr_message_recipient_minimum recipients at a time. This +would really degrade performance. For this reason, each transport has an extra +pool of transport_extra_recipient_limit recipient slots, dedicated exactly for +this situation. Each time an unread job preempts the current job, it gets half +of the remaining recipient slots from the normal pool and this extra pool. + +And that's it. It sure does sound pretty complicated, but fortunately most +people don't really have to care exactly how it works as long as it works. +Perhaps the only important things to know for most people are the following +upper bound formulas: + +Each transport has at maximum + + max( + qmgr_message_recipient_minimum * qmgr_message_active_limit + + *_recipient_limit + *_extra_recipient_limit, + qmgr_message_recipient_limit + ) + +recipients in core. + +The total amount of recipients in core is + + max( + qmgr_message_recipient_minimum * qmgr_message_active_limit + + sum( *_recipient_limit + *_extra_recipient_limit ), + qmgr_message_recipient_limit + ) + +where the sum is over all used transports. + +And this terribly complicated chapter concludes the documentation of the nqmgr +scheduler. + +[By now you should theoretically know the nqmgr scheduler inside out. In +practice, you still hope that you will never have to really understand the last +or last two chapters completely, and fortunately most people really won't. +Understanding how the scheduler works in ideal conditions is more than good +enough for the vast majority of users.] + +CCrreeddiittss + + * Wietse Venema designed and implemented the initial queue manager with per- + domain FIFO scheduling, and per-delivery +/-1 concurrency feedback. + * Patrik Rak designed and implemented preemption where mail with fewer + recipients can slip past mail with more recipients in a controlled manner, + and wrote up its documentation. + * Wietse Venema initiated a discussion with Patrik Rak and Victor Duchovni on + alternatives for the +/-1 feedback scheduler's aggressive behavior. This is + when K/N feedback was reviewed (N = concurrency). The discussion ended + without a good solution for both negative feedback and dead site detection. + * Victor Duchovni resumed work on concurrency feedback in the context of + concurrency-limited servers. + * Wietse Venema then re-designed the concurrency scheduler in terms of the + simplest possible concepts: less-than-1 concurrency feedback per delivery, + forward and reverse concurrency feedback hysteresis, and pseudo-cohort + failure. At this same time, concurrency feedback was separated from dead + site detection. + * These simplifications, and their modular implementation, helped to develop + further insights into the different roles that positive and negative + concurrency feedback play, and helped to identify some worst-case + scenarios. + |