summaryrefslogtreecommitdiffstats
path: root/README_FILES/SCHEDULER_README
diff options
context:
space:
mode:
Diffstat (limited to 'README_FILES/SCHEDULER_README')
-rw-r--r--README_FILES/SCHEDULER_README1160
1 files changed, 1160 insertions, 0 deletions
diff --git a/README_FILES/SCHEDULER_README b/README_FILES/SCHEDULER_README
new file mode 100644
index 0000000..a6f7702
--- /dev/null
+++ b/README_FILES/SCHEDULER_README
@@ -0,0 +1,1160 @@
+PPoossttffiixx QQuueeuuee SScchheedduulleerr
+
+-------------------------------------------------------------------------------
+
+DDiissccllaaiimmeerr
+
+Many of the transport-specific configuration parameters discussed in this
+document will not show up in "postconf" command output before Postfix version
+2.9. This limitation applies to many parameters whose name is a combination of
+a master.cf service name such as "relay" and a built-in suffix such as
+"_destination_concurrency_limit".
+
+OOvveerrvviieeww
+
+The queue manager is by far the most complex part of the Postfix mail system.
+It schedules delivery of new mail, retries failed deliveries at specific times,
+and removes mail from the queue after the last delivery attempt. There are two
+major classes of mechanisms that control the operation of the queue manager.
+
+Topics covered by this document:
+
+ * Concurrency scheduling, concerned with the number of concurrent deliveries
+ to a specific destination, including decisions on when to suspend
+ deliveries after persistent failures.
+ * Preemptive scheduling, concerned with the selection of email messages and
+ recipients for a given destination.
+ * Credits, something this document would not be complete without.
+
+CCoonnccuurrrreennccyy sscchheedduulliinngg
+
+The following sections document the Postfix 2.5 concurrency scheduler, after a
+discussion of the limitations of the earlier concurrency scheduler. This is
+followed by results of medium-concurrency experiments, and a discussion of
+trade-offs between performance and robustness.
+
+The material is organized as follows:
+
+ * Drawbacks of the existing concurrency scheduler
+ * Summary of the Postfix 2.5 concurrency feedback algorithm
+ * Summary of the Postfix 2.5 "dead destination" detection algorithm
+ * Pseudocode for the Postfix 2.5 concurrency scheduler
+ * Results for delivery to concurrency limited servers
+ * Discussion of concurrency limited server results
+ * Limitations of less-than-1 per delivery feedback
+ * Concurrency configuration parameters
+
+DDrraawwbbaacckkss ooff tthhee eexxiissttiinngg ccoonnccuurrrreennccyy sscchheedduulleerr
+
+From the start, Postfix has used a simple but robust algorithm where the per-
+destination delivery concurrency is decremented by 1 after delivery failed due
+to connection or handshake failure, and incremented by 1 otherwise. Of course
+the concurrency is never allowed to exceed the maximum per-destination
+concurrency limit. And when a destination's concurrency level drops to zero,
+the destination is declared "dead" and delivery is suspended.
+
+Drawbacks of +/-1 concurrency feedback per delivery are:
+
+ * Overshoot due to exponential delivery concurrency growth with each pseudo-
+ cohort(*). This can be an issue with high-concurrency channels. For
+ example, with the default initial concurrency of 5, concurrency would
+ proceed over time as (5-10-20).
+
+ * Throttling down to zero concurrency after a single pseudo-cohort(*)
+ failure. This was especially an issue with low-concurrency channels where a
+ single failure could be sufficient to mark a destination as "dead", causing
+ the suspension of further deliveries to the affected destination.
+
+(*) A pseudo-cohort is a number of delivery requests equal to a destination's
+delivery concurrency.
+
+The revised concurrency scheduler has a highly modular structure. It uses
+separate mechanisms for per-destination concurrency control and for "dead
+destination" detection. The concurrency control in turn is built from two
+separate mechanisms: it supports less-than-1 feedback per delivery to allow for
+more gradual concurrency adjustments, and it uses feedback hysteresis to
+suppress concurrency oscillations. And instead of waiting for delivery
+concurrency to throttle down to zero, a destination is declared "dead" after a
+configurable number of pseudo-cohorts reports connection or handshake failure.
+
+SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy ffeeeeddbbaacckk aallggoorriitthhmm
+
+We want to increment a destination's delivery concurrency when some (not
+necessarily consecutive) number of deliveries complete without connection or
+handshake failure. This is implemented with positive feedback g(N) where N is
+the destination's delivery concurrency. With g(N)=1 feedback per delivery,
+concurrency increases by 1 after each positive feedback event; this gives us
+the old scheduler's exponential growth in time. With g(N)=1/N feedback per
+delivery, concurrency increases by 1 after an entire pseudo-cohort N of
+positive feedback reports; this gives us linear growth in time. Less-than-
+1 feedback per delivery and integer truncation naturally give us hysteresis, so
+that transitions to larger concurrency happen every 1/g(N) positive feedback
+events.
+
+We want to decrement a destination's delivery concurrency when some (not
+necessarily consecutive) number of deliveries complete after connection or
+handshake failure. This is implemented with negative feedback f(N) where N is
+the destination's delivery concurrency. With f(N)=1 feedback per delivery,
+concurrency decreases by 1 after each negative feedback event; this gives us
+the old scheduler's behavior where concurrency is throttled down dramatically
+after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery,
+concurrency backs off more gently. Again, less-than-1 feedback per delivery and
+integer truncation naturally give us hysteresis, so that transitions to lower
+concurrency happen every 1/f(N) negative feedback events.
+
+However, with negative feedback we introduce a subtle twist. We "reverse" the
+negative hysteresis cycle so that the transition to lower concurrency happens
+at the bbeeggiinnnniinngg of a sequence of 1/f(N) negative feedback events. Otherwise, a
+correction for overload would be made too late. This makes the choice of f(N)
+relatively unimportant, as borne out by measurements later in this document.
+
+In summary, the main ingredients for the Postfix 2.5 concurrency feedback
+algorithm are a) the option of less-than-1 positive feedback per delivery to
+avoid overwhelming servers, b) the option of less-than-1 negative feedback per
+delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid
+oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that
+it can correct for overload quickly.
+
+SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ""ddeeaadd ddeessttiinnaattiioonn"" ddeetteeccttiioonn aallggoorriitthhmm
+
+We want to suspend deliveries to a specific destination after some number of
+deliveries suffers connection or handshake failure. The old scheduler declares
+a destination "dead" when negative (-1) feedback throttles the delivery
+concurrency down to zero. With less-than-1 feedback per delivery, this
+throttling down would obviously take too long. We therefore have to separate
+"dead destination" detection from concurrency feedback. This is implemented by
+introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency
+scheduler declares a destination "dead" after a configurable number of pseudo-
+cohorts suffers from connection or handshake failures. The old scheduler
+corresponds to the special case where the pseudo-cohort failure limit is equal
+to 1.
+
+PPsseeuuddooccooddee ffoorr tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy sscchheedduulleerr
+
+The pseudo code shows how the ideas behind new concurrency scheduler are
+implemented as of November 2007. The actual code can be found in the module
+qmgr/qmgr_queue.c.
+
+Types:
+ Each destination has one set of the following variables
+ int concurrency
+ double success
+ double failure
+ double fail_cohorts
+
+Feedback functions:
+ N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
+ positive feedback: g(N) = x/N | x/sqrt(N) | x
+ negative feedback: f(N) = y/N | y/sqrt(N) | y
+
+Initialization:
+ concurrency = initial_concurrency
+ success = 0
+ failure = 0
+ fail_cohorts = 0
+
+After success:
+ fail_cohorts = 0
+ Be prepared for feedback > hysteresis, or rounding error
+ success += g(concurrency)
+ while (success >= 1) Hysteresis 1
+ concurrency += 1 Hysteresis 1
+ failure = 0
+ success -= 1 Hysteresis 1
+ Be prepared for overshoot
+ if (concurrency > concurrency limit)
+ concurrency = concurrency limit
+
+Safety:
+ Don't apply positive feedback unless
+ concurrency < busy_refcount + init_dest_concurrency
+ otherwise negative feedback effect could be delayed
+
+After failure:
+ if (concurrency > 0)
+ fail_cohorts += 1.0 / concurrency
+ if (fail_cohorts > cohort_failure_limit)
+ concurrency = 0
+ if (concurrency > 0)
+ Be prepared for feedback > hysteresis, rounding errors
+ failure -= f(concurrency)
+ while (failure < 0)
+ concurrency -= 1 Hysteresis 1
+ failure += 1 Hysteresis 1
+ success = 0
+ Be prepared for overshoot
+ if (concurrency < 1)
+ concurrency = 1
+
+RReessuullttss ffoorr ddeelliivveerryy ttoo ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerrss
+
+Discussions about the concurrency scheduler redesign started early 2004, when
+the primary goal was to find alternatives that did not exhibit exponential
+growth or rapid concurrency throttling. No code was implemented until late
+2007, when the primary concern had shifted towards better handling of server
+concurrency limits. For this reason we measure how well the new scheduler does
+this job. The table below compares mail delivery performance of the old +/-
+1 feedback per delivery with several less-than-1 feedback functions, for
+different limited-concurrency server scenarios. Measurements were done with a
+FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers.
+
+Server configuration:
+
+ * The mail flow was slowed down with 1 second latency per recipient
+ ("smtpd_client_restrictions = sleep 1"). The purpose was to make results
+ less dependent on hardware details, by avoiding slow-downs by queue file I/
+ O, logging I/O, and network I/O.
+ * Concurrency was limited by the server process limit ("default_process_limit
+ = 5" and "smtpd_client_event_limit_exceptions = static:all"). Postfix was
+ stopped and started after changing the process limit, because the same
+ number is also used as the backlog argument to the listen(2) system call,
+ and "postfix reload" does not re-issue this call.
+ * Mail was discarded with "local_recipient_maps = static:all" and
+ "local_transport = discard". The discard action in access maps or header/
+ body checks could not be used as it fails to update the in_flow_delay
+ counters.
+
+Client configuration:
+
+ * Queue file overhead was minimized by sending one message to a virtual alias
+ that expanded into 2000 different remote recipients. All recipients were
+ accounted for according to the maillog file. The
+ virtual_alias_expansion_limit setting was increased to avoid complaints
+ from the cleanup(8) server.
+ * The number of deliveries was maximized with
+ "smtp_destination_recipient_limit = 2". A smaller limit would cause Postfix
+ to schedule the concurrency per recipient instead of domain, which is not
+ what we want.
+ * Maximum concurrency was limited with "smtp_destination_concurrency_limit =
+ 20", and initial_destination_concurrency was set to the same value.
+ * The positive and negative concurrency feedback hysteresis was 1.
+ Concurrency was incremented by 1 at the END of 1/feedback steps of positive
+ feedback, and was decremented by 1 at the START of 1/feedback steps of
+ negative feedback.
+ * The SMTP client used the default 30s SMTP connect timeout and 300s SMTP
+ greeting timeout.
+
+IImmppaacctt ooff tthhee 3300ss SSMMTTPP ccoonnnneecctt ttiimmeeoouutt
+
+The first results are for a FreeBSD 6.2 server, where our artificially low
+listen(2) backlog results in a very short kernel queue for established
+connections. The table shows that all deferred deliveries failed due to a 30s
+connection timeout, and none failed due to a server greeting timeout. This
+measurement simulates what happens when the server's connection queue is
+completely full under load, and the TCP engine drops new connections.
+
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt//
+ aavveerraaggee//ssttddddeevv ggrreeeettiinngg
+
+ -------------------------------------------------------------------------
+ 20 5 1/N no 9.9 19.4 0.49 198 -
+
+ 20 5 1/N yes 10.3 19.4 0.49 206 -
+
+ 20 5 1/sqrt(N) no 10.4 19.6 0.59 208 -
+
+ 20 5 1/sqrt(N) yes 10.6 19.6 0.61 212 -
+
+ 20 5 1 no 10.1 19.5 1.29 202 -
+
+ 20 5 1 yes 10.8 19.3 1.57 216 -
+
+ -------------------------------------------------------------------------
+
+ A busy server with a completely full connection queue. N is the client
+ delivery concurrency. Failed deliveries time out after 30s without
+ completing the TCP handshake. See text for a discussion of results.
+
+IImmppaacctt ooff tthhee 330000ss SSMMTTPP ggrreeeettiinngg ttiimmeeoouutt
+
+The next table shows results for a Fedora Core 8 server (results for RedHat 7.3
+are identical). In this case, the artificially small listen(2) backlog argument
+does not impact our measurement. The table shows that practically all deferred
+deliveries fail after the 300s SMTP greeting timeout. As these timeouts were
+10x longer than with the first measurement, we increased the recipient count
+(and thus the running time) by a factor of 10 to keep the results comparable.
+The deferred mail percentages are a factor 10 lower than with the first
+measurement, because the 1s per-recipient delay was 1/300th of the greeting
+timeout instead of 1/30th of the connection timeout.
+
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt//
+ aavveerraaggee//ssttddddeevv ggrreeeettiinngg
+
+ -------------------------------------------------------------------------
+ 20 5 1/N no 1.16 19.8 0.37 - 230
+
+ 20 5 1/N yes 1.36 19.8 0.36 - 272
+
+ 20 5 1/sqrt(N) no 1.21 19.9 0.23 4 238
+
+ 20 5 1/sqrt(N) yes 1.36 20.0 0.23 - 272
+
+ 20 5 1 no 1.18 20.0 0.16 - 236
+
+ 20 5 1 yes 1.39 20.0 0.16 - 278
+
+ -------------------------------------------------------------------------
+
+ A busy server with a non-full connection queue. N is the client delivery
+ concurrency. Failed deliveries complete at the TCP level, but time out
+ after 300s while waiting for the SMTP greeting. See text for a discussion
+ of results.
+
+IImmppaacctt ooff aaccttiivvee sseerrvveerr ccoonnccuurrrreennccyy lliimmiitteerr
+
+The final concurrency-limited result shows what happens when SMTP connections
+don't time out, but are rejected immediately with the Postfix server's
+smtpd_client_connection_count_limit feature (the server replies with a 421
+status and disconnects immediately). Similar results can be expected with
+concurrency limiting features built into other MTAs or firewalls. For this
+measurement we specified a server concurrency limit and a client initial
+destination concurrency of 5, and a server process limit of 10; all other
+conditions were the same as with the first measurement. The same result would
+be obtained with a FreeBSD or Linux server, because the "pushing back" is done
+entirely by the receiving side.
+
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt tthheeoorreettiiccaall
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ddeeffeerr rraattee
+ aavveerraaggee//ssttddddeevv
+
+ -------------------------------------------------------------------------
+ 20 5 1/N no 16.5 5.17 0.38 1/6
+
+ 20 5 1/N yes 16.5 5.17 0.38 1/6
+
+ 20 5 1/sqrt(N) no 24.5 5.28 0.45 1/4
+
+ 20 5 1/sqrt(N) yes 24.3 5.28 0.46 1/4
+
+ 20 5 1 no 49.7 5.63 0.67 1/2
+
+ 20 5 1 yes 49.7 5.68 0.70 1/2
+
+ -------------------------------------------------------------------------
+
+ A server with active per-client concurrency limiter that replies with 421
+ and disconnects. N is the client delivery concurrency. The theoretical
+ defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed
+ +/-1 feedback per delivery; with the concurrency-dependent feedback
+ variants, the defer rate decreases with increasing concurrency. See text
+ for a discussion of results.
+
+DDiissccuussssiioonn ooff ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerr rreessuullttss
+
+All results in the previous sections are based on the first delivery runs only;
+they do not include any second etc. delivery attempts. It's also worth noting
+that the measurements look at steady-state behavior only. They don't show what
+happens when the client starts sending at a much higher or lower concurrency.
+
+The first two examples show that the effect of feedback is negligible when
+concurrency is limited due to congestion. This is because the initial
+concurrency is already at the client's concurrency maximum, and because there
+is 10-100 times more positive than negative feedback. Under these conditions,
+it is no surprise that the contribution from SMTP connection caching is also
+negligible.
+
+In the last example, the old +/-1 feedback per delivery will defer 50% of the
+mail when confronted with an active (anvil-style) server concurrency limit,
+where the server hangs up immediately with a 421 status (a TCP-level RST would
+have the same result). Less aggressive feedback mechanisms fare better than
+more aggressive ones. Concurrency-dependent feedback fares even better at
+higher concurrencies than shown here, but has limitations as discussed in the
+next section.
+
+LLiimmiittaattiioonnss ooff lleessss--tthhaann--11 ppeerr ddeelliivveerryy ffeeeeddbbaacckk
+
+Less-than-1 feedback is of interest primarily when sending large amounts of
+mail to destinations with active concurrency limiters (servers that reply with
+421, or firewalls that send RST). When sending small amounts of mail per
+destination, less-than-1 per-delivery feedback won't have a noticeable effect
+on the per-destination concurrency, because the number of deliveries to the
+same destination is too small. You might just as well use zero per-delivery
+feedback and stay with the initial per-destination concurrency. And when mail
+deliveries fail due to congestion instead of active concurrency limiters, the
+measurements above show that per-delivery feedback has no effect. With large
+amounts of mail you might just as well use zero per-delivery feedback and start
+with the maximal per-destination concurrency.
+
+The scheduler with less-than-1 concurrency feedback per delivery solves a
+problem with servers that have active concurrency limiters. This works only
+because feedback is handled in a peculiar manner: positive feedback will
+increment the concurrency by 1 at the eenndd of a sequence of events of length 1/
+feedback, while negative feedback will decrement concurrency by 1 at the
+bbeeggiinnnniinngg of such a sequence. This is how Postfix adjusts quickly for overshoot
+without causing lots of mail to be deferred. Without this difference in
+feedback treatment, less-than-1 feedback per delivery would defer 50% of the
+mail, and would be no better in this respect than the old +/-1 feedback per
+delivery.
+
+Unfortunately, the same feature that corrects quickly for concurrency overshoot
+also makes the scheduler more sensitive for noisy negative feedback. The reason
+is that one lonely negative feedback event has the same effect as a complete
+sequence of length 1/feedback: in both cases delivery concurrency is dropped by
+1 immediately. As a worst-case scenario, consider multiple servers behind a
+load balancer on a single IP address, and no backup MX address. When 1 out of K
+servers fails to complete the SMTP handshake or drops the connection, a
+scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency
+once it reaches a concurrency level of about K, even though the good servers
+behind the load balancer are perfectly capable of handling more traffic.
+
+This noise problem gets worse as the amount of positive feedback per delivery
+gets smaller. A compromise is to use fixed less-than-1 positive feedback values
+instead of concurrency-dependent positive feedback. For example, to tolerate 1
+of 4 bad servers in the above load balancer scenario, use positive feedback of
+1/4 per "good" delivery (no connect or handshake error), and use an equal or
+smaller amount of negative feedback per "bad" delivery. The downside of using
+concurrency-independent feedback is that some of the old +/-1 feedback problems
+will return at large concurrencies. Sites that must deliver mail at non-trivial
+per-destination concurrencies will require special configuration.
+
+CCoonnccuurrrreennccyy ccoonnffiigguurraattiioonn ppaarraammeetteerrss
+
+The Postfix 2.5 concurrency scheduler is controlled with the following
+configuration parameters, where "transport_foo" provides a transport-specific
+parameter override. All parameter default settings are compatible with earlier
+Postfix versions.
+
+ PPaarraammeetteerr nnaammee PPoossttffiixx DDeessccrriippttiioonn
+ vveerrssiioonn
+
+ ---------------------------------------------------------------------------
+ Initial per-
+ initial_destination_concurrency all destination
+ transport_initial_destination_concurrency 2.5 delivery
+ concurrency
+
+ Maximum per-
+ default_destination_concurrency_limit all destination
+ transport_destination_concurrency_limit all delivery
+ concurrency
+
+ Per-
+ destination
+ positive
+ feedback
+ default_destination_concurrency_positive_feedback 2.5 amount, per
+ transport_destination_concurrency_positive_feedback 2.5 delivery that
+ does not fail
+ with
+ connection or
+ handshake
+ failure
+
+ Per-
+ destination
+ negative
+ feedback
+ default_destination_concurrency_negative_feedback 2.5 amount, per
+ transport_destination_concurrency_negative_feedback 2.5 delivery that
+ fails with
+ connection or
+ handshake
+ failure
+
+ Number of
+ failed
+ pseudo-
+ cohorts after
+ default_destination_concurrency_failed_cohort_limit 2.5 which a
+ transport_destination_concurrency_failed_cohort_limit 2.5 destination
+ is declared
+ "dead" and
+ delivery is
+ suspended
+
+ Enable
+ verbose
+ destination_concurrency_feedback_debug 2.5 logging of
+ concurrency
+ scheduler
+ activity
+
+ ---------------------------------------------------------------------------
+
+PPrreeeemmppttiivvee sscchheedduulliinngg
+
+The following sections describe the new queue manager and its preemptive
+scheduler algorithm. Note that the document was originally written to describe
+the changes between the new queue manager (in this text referred to as nqmgr,
+the name it was known by before it became the default queue manager) and the
+old queue manager (referred to as oqmgr). This is why it refers to oqmgr every
+so often.
+
+This document is divided into sections as follows:
+
+ * The structures used by nqmgr
+ * What happens when nqmgr picks up the message - how it is assigned to
+ transports, jobs, peers, entries
+ * How the entry selection works
+ * How the preemption works - what messages may be preempted and how and what
+ messages are chosen to preempt them
+ * How destination concurrency limits affect the scheduling algorithm
+ * Dealing with memory resource limits
+
+TThhee ssttrruuccttuurreess uusseedd bbyy nnqqmmggrr
+
+Let's start by recapitulating the structures and terms used when referring to
+queue manager and how it operates. Many of these are partially described
+elsewhere, but it is nice to have a coherent overview in one place:
+
+ * Each message structure represents one mail message which Postfix is to
+ deliver. The message recipients specify to what destinations is the message
+ to be delivered and what transports are going to be used for the delivery.
+
+ * Each recipient entry groups a batch of recipients of one message which are
+ all going to be delivered to the same destination (and over the same
+ transport).
+
+ * Each transport structure groups everything what is going to be delivered by
+ delivery agents dedicated for that transport. Each transport maintains a
+ set of queues (describing the destinations it shall talk to) and jobs
+ (referencing the messages it shall deliver).
+
+ * Each transport queue (not to be confused with the on-disk active queue or
+ incoming queue) groups everything what is going be delivered to given
+ destination (aka nexthop) by its transport. Each queue belongs to one
+ transport, so each destination may be referred to by several queues, one
+ for each transport. Each queue maintains a list of all recipient entries
+ (batches of message recipients) which shall be delivered to given
+ destination (the todo list), and a list of recipient entries already being
+ delivered by the delivery agents (the busy list).
+
+ * Each queue corresponds to multiple peer structures. Each peer structure is
+ like the queue structure, belonging to one transport and referencing one
+ destination. The difference is that it lists only the recipient entries
+ which all originate from the same message, unlike the queue structure,
+ whose entries may originate from various messages. For messages with few
+ recipients, there is usually just one recipient entry for each destination,
+ resulting in one recipient entry per peer. But for large mailing list
+ messages the recipients may need to be split to multiple recipient entries,
+ in which case the peer structure may list many entries for single
+ destination.
+
+ * Each transport job groups everything it takes to deliver one message via
+ its transport. Each job represents one message within the context of the
+ transport. The job belongs to one transport and message, so each message
+ may have multiple jobs, one for each transport. The job groups all the peer
+ structures, which describe the destinations the job's message has to be
+ delivered to.
+
+The first four structures are common to both nqmgr and oqmgr, the latter two
+were introduced by nqmgr.
+
+These terms are used extensively in the text below, feel free to look up the
+description above anytime you'll feel you have lost a sense what is what.
+
+WWhhaatt hhaappppeennss wwhheenn nnqqmmggrr ppiicckkss uupp tthhee mmeessssaaggee
+
+Whenever nqmgr moves a queue file into the active queue, the following happens:
+It reads all necessary information from the queue file as oqmgr does, and also
+reads as many recipients as possible - more on that later, for now let's just
+pretend it always reads all recipients.
+
+Then it resolves the recipients as oqmgr does, which means obtaining (address,
+nexthop, transport) triple for each recipient. For each triple, it finds the
+transport; if it does not exist yet, it instantiates it (unless it's dead).
+Within the transport, it finds the destination queue for given nexthop; if it
+does not exist yet, it instantiates it (unless it's dead). The triple is then
+bound to given destination queue. This happens in qmgr_resolve() and is
+basically the same as in oqmgr.
+
+Then for each triple which was bound to some queue (and thus transport), the
+program finds the job which represents the message within that transport's
+context; if it does not exist yet, it instantiates it. Within the job, it finds
+the peer which represents the bound destination queue within this jobs context;
+if it does not exist yet, it instantiates it. Finally, it stores the address
+from the resolved triple to the recipient entry which is appended to both the
+queue entry list and the peer entry list. The addresses for same nexthop are
+batched in the entries up to recipient_concurrency limit for that transport.
+This happens in qmgr_assign() and apart from that it operates with job and peer
+structures it is basically the same as in oqmgr.
+
+When the job is instantiated, it is enqueued on the transport's job list based
+on the time its message was picked up by nqmgr. For first batch of recipients
+this means it is appended to the end of the job list, but the ordering of the
+job list by the enqueue time is important as we will see shortly.
+
+[Now you should have pretty good idea what is the state of the nqmgr after
+couple of messages was picked up, what is the relation between all those job,
+peer, queue and entry structures.]
+
+HHooww tthhee eennttrryy sseelleeccttiioonn wwoorrkkss
+
+Having prepared all those above mentioned structures, the task of the nqmgr's
+scheduler is to choose the recipient entries one at a time and pass them to the
+delivery agent for corresponding transport. Now how does this work?
+
+The first approximation of the new scheduling algorithm is like this:
+
+ foreach transport (round-robin-by-transport)
+ do
+ if transport busy continue
+ if transport process limit reached continue
+ foreach transport's job (in the order of the transport's job list)
+ do
+ foreach job's peer (round-robin-by-destination)
+ if peer->queue->concurrency < peer->queue->window
+ return next peer entry.
+ done
+ done
+ done
+
+Now what is the "order of the transport's job list"? As we know already, the
+job list is by default kept in the order the message was picked up by the
+nqmgr. So by default we get the top-level round-robin transport, and within
+each transport we get the FIFO message delivery. The round-robin of the peers
+by the destination is perhaps of little importance in most real-life cases
+(unless the recipient_concurrency limit is reached, in one job there is only
+one peer structure for each destination), but theoretically it makes sure that
+even within single jobs, destinations are treated fairly.
+
+[By now you should have a feeling you really know how the scheduler works,
+except for the preemption, under ideal conditions - that is, no recipient
+resource limits and no destination concurrency problems.]
+
+HHooww tthhee pprreeeemmppttiioonn wwoorrkkss
+
+As you might perhaps expect by now, the transport's job list does not remain
+sorted by the job's message enqueue time all the time. The most cool thing
+about nqmgr is not the simple FIFO delivery, but that it is able to slip mail
+with little recipients past the mailing-list bulk mail. This is what the job
+preemption is about - shuffling the jobs on the transport's job list to get the
+best message delivery rates. Now how is it achieved?
+
+First I have to tell you that there are in fact two job lists in each
+transport. One is the scheduler's job list, which the scheduler is free to play
+with, while the other one keeps the jobs always listed in the order of the
+enqueue time and is used for recipient pool management we will discuss later.
+For now, we will deal with the scheduler's job list only.
+
+So, we have the job list, which is first ordered by the time the jobs' messages
+were enqueued, oldest messages first, the most recently picked one at the end.
+For now, let's assume that there are no destination concurrency problems.
+Without preemption, we pick some entry of the first (oldest) job on the queue,
+assign it to delivery agent, pick another one from the same job, assign it
+again, and so on, until all the entries are used and the job is delivered. We
+would then move onto the next job and so on and on. Now how do we manage to
+sneak in some entries from the recently added jobs when the first job on the
+job list belongs to a message going to the mailing-list and has thousands of
+recipient entries?
+
+The nqmgr's answer is that we can artificially "inflate" the delivery time of
+that first job by some constant for free - it is basically the same trick you
+might remember as "accumulation of potential" from the amortized complexity
+lessons. For example, instead of delivering the entries of the first job on the
+job list every time a delivery agent becomes available, we can do it only every
+second time. If you view the moments the delivery agent becomes available on a
+timeline as "delivery slots", then instead of using every delivery slot for the
+first job, we can use only every other slot, and still the overall delivery
+efficiency of the first job remains the same. So the delivery 11112222 becomes
+1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free
+slot). Now what do we do with free slots?
+
+As you might have guessed, we will use them for sneaking the mail with little
+recipients in. For example, if we have one four-recipient mail followed by four
+one recipients mail, the delivery sequence (that is, the sequence in which the
+jobs are assigned to the delivery slots) might look like this: 12131415. Hmm,
+fine for sneaking in the single recipient mail, but how do we sneak in the mail
+with more than one recipient? Say if we have one four-recipient mail followed
+by two two-recipient mails?
+
+The simple answer would be to use delivery sequence 12121313. But the problem
+is that this does not scale well. Imagine you have mail with thousand
+recipients followed by mail with hundred recipients. It is tempting to suggest
+the delivery sequence like 121212...., but alas! Imagine there arrives another
+mail with say ten recipients. But there are no free slots anymore, so it can't
+slip by, not even if it had just only one recipients. It will be stuck until
+the hundred-recipient mail is delivered, which really sucks.
+
+So, it becomes obvious that while inflating the message to get free slots is
+great idea, one has to be really careful of how the free slots are assigned,
+otherwise one might corner himself. So, how does nqmgr really use the free
+slots?
+
+The key idea is that one does not have to generate the free slots in a uniform
+way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is
+even better as some entries are in the first case selected earlier than in the
+second case, and none is selected later! So it is possible to first
+"accumulate" the free delivery slots and then use them all at once. It is even
+possible to accumulate some, then use them, then accumulate some more and use
+them again, as in 11..1.1 .
+
+Let's get back to the one hundred recipient example. We now know that we could
+first accumulate one hundred free slots, and only after then to preempt the
+first job and sneak the one hundred recipient mail in. Applying the algorithm
+recursively, we see the hundred recipient job can accumulate ten free delivery
+slots, and then we could preempt it and sneak in the ten-recipient mail... Wait
+wait wait! Could we? Aren't we overinflating the original one thousand
+recipient mail?
+
+Well, despite it looks so at the first glance, another trick will allow us to
+answer "no, we are not!". If we had said that we will inflate the delivery time
+twice at maximum, and then we consider every other slot as a free slot, then we
+would overinflate in case of the recursive preemption. BUT! The trick is that
+if we use only every n-th slot as a free slot for n>2, there is always some
+worst inflation factor which we can guarantee not to be breached, even if we
+apply the algorithm recursively. To be precise, if for every k>1 normally used
+slots we accumulate one free delivery slot, than the inflation factor is not
+worse than k/(k-1) no matter how many recursive preemptions happen. And it's
+not worse than (k+1)/k if only non-recursive preemption happens. Now, having
+got through the theory and the related math, let's see how nqmgr implements
+this.
+
+Each job has so called "available delivery slot" counter. Each transport has a
+transport_delivery_slot_cost parameter, which defaults to
+default_delivery_slot_cost parameter which is set to 5 by default. This is the
+k from the paragraph above. Each time k entries of the job are selected for
+delivery, this counter is incremented by one. Once there are some slots
+accumulated, job which requires no more than that number of slots to be fully
+delivered can preempt this job.
+
+[Well, the truth is, the counter is incremented every time an entry is selected
+and it is divided by k when it is used. But for the understanding it's good
+enough to use the above approximation of the truth.]
+
+OK, so now we know the conditions which must be satisfied so one job can
+preempt another one. But what job gets preempted, how do we choose what job
+preempts it if there are several valid candidates, and when does all this
+exactly happen?
+
+The answer for the first part is simple. The job whose entry was selected the
+last time is so called current job. Normally, it is the first job on the
+scheduler's job list, but destination concurrency limits may change this as we
+will see later. It is always only the current job which may get preempted.
+
+Now for the second part. The current job has certain amount of recipient
+entries, and as such may accumulate at maximum some amount of available
+delivery slots. It might have already accumulated some, and perhaps even
+already used some when it was preempted before (remember a job can be preempted
+several times). In either case, we know how many are accumulated and how many
+are left to deliver, so we know how many it may yet accumulate at maximum.
+Every other job which may be delivered by less than that number of slots is a
+valid candidate for preemption. How do we choose among them?
+
+The answer is - the one with maximum enqueue_time/recipient_entry_count. That
+is, the older the job is, the more we should try to deliver it in order to get
+best message delivery rates. These rates are of course subject to how many
+recipients the message has, therefore the division by the recipient (entry)
+count. No one shall be surprised that message with n recipients takes n times
+longer to deliver than message with one recipient.
+
+Now let's recap the previous two paragraphs. Isn't it too complicated? Why
+don't the candidates come only among the jobs which can be delivered within the
+number of slots the current job already accumulated? Why do we need to estimate
+how much it has yet to accumulate? If you found out the answer, congratulate
+yourself. If we did it this simple way, we would always choose the candidate
+with least recipient entries. If there were enough single recipient mails
+coming in, they would always slip by the bulk mail as soon as possible, and the
+two and more recipients mail would never get a chance, no matter how long they
+have been sitting around in the job list.
+
+This candidate selection has interesting implication - that when we choose the
+best candidate for preemption (this is done in qmgr_choose_candidate()), it may
+happen that we may not use it for preemption immediately. This leads to an
+answer to the last part of the original question - when does the preemption
+happen?
+
+The preemption attempt happens every time next transport's recipient entry is
+to be chosen for delivery. To avoid needless overhead, the preemption is not
+attempted if the current job could never accumulate more than
+transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots
+which defaults to 3). If there is already enough accumulated slots to preempt
+the current job by the chosen best candidate, it is done immediately. This
+basically means that the candidate is moved in front of the current job on the
+scheduler's job list and decreasing the accumulated slot counter by the amount
+used by the candidate. If there is not enough slots... well, I could say that
+nothing happens and the another preemption is attempted the next time. But
+that's not the complete truth.
+
+The truth is that it turns out that it is not really necessary to wait until
+the jobs counter accumulates all the delivery slots in advance. Say we have
+ten-recipient mail followed by two two-recipient mails. If the preemption
+happened when enough delivery slot accumulate (assuming slot cost 2), the
+delivery sequence becomes 11112211113311. Now what would we get if we would
+wait only for 50% of the necessary slots to accumulate and we promise we would
+wait for the remaining 50% later, after we get back to the preempted job? If we
+use such slot loan, the delivery sequence becomes 11221111331111. As we can
+see, it makes it no considerably worse for the delivery of the ten-recipient
+mail, but it allows the small messages to be delivered sooner.
+
+The concept of these slot loans is where the transport_delivery_slot_discount
+and transport_delivery_slot_loan come from (they default to
+default_delivery_slot_discount and default_delivery_slot_loan, whose values are
+by default 50 and 3, respectively). The discount (resp. loan) specifies how
+many percent (resp. how many slots) one "gets in advance", when the number of
+slots required to deliver the best candidate is compared with the number of
+slots the current slot had accumulated so far.
+
+And it pretty much concludes this chapter.
+
+[Now you should have a feeling that you pretty much understand the scheduler
+and the preemption, or at least that you will have it after you read the last
+chapter couple more times. You shall clearly see the job list and the
+preemption happening at its head, in ideal delivery conditions. The feeling of
+understanding shall last until you start wondering what happens if some of the
+jobs are blocked, which you might eventually figure out correctly from what had
+been said already. But I would be surprised if your mental image of the
+scheduler's functionality is not completely shattered once you start wondering
+how it works when not all recipients may be read in-core. More on that later.]
+
+HHooww ddeessttiinnaattiioonn ccoonnccuurrrreennccyy lliimmiittss aaffffeecctt tthhee sscchheedduulliinngg aallggoorriitthhmm
+
+The nqmgr uses the same algorithm for destination concurrency control as oqmgr.
+Now what happens when the destination limits are reached and no more entries
+for that destination may be selected by the scheduler?
+
+From user's point of view it is all simple. If some of the peers of a job can't
+be selected, those peers are simply skipped by the entry selection algorithm
+(the pseudo-code described before) and only the selectable ones are used. If
+none of the peers may be selected, the job is declared a "blocker job". Blocker
+jobs are skipped by the entry selection algorithm and they are also excluded
+from the candidates for preemption of current job. Thus the scheduler
+effectively behaves as if the blocker jobs didn't exist on the job list at all.
+As soon as at least one of the peers of a blocker job becomes unblocked (that
+is, the delivery agent handling the delivery of the recipient entry for given
+destination successfully finishes), the job's blocker status is removed and the
+job again participates in all further scheduler actions normally.
+
+So the summary is that the users don't really have to be concerned about the
+interaction of the destination limits and scheduling algorithm. It works well
+on its own and there are no knobs they would need to control it.
+
+From a programmer's point of view, the blocker jobs complicate the scheduler
+quite a lot. Without them, the jobs on the job list would be normally delivered
+in strict FIFO order. If the current job is preempted, the job preempting it is
+completely delivered unless it is preempted itself. Without blockers, the
+current job is thus always either the first job on the job list, or the top of
+the stack of jobs preempting the first job on the job list.
+
+The visualization of the job list and the preemption stack without blockers
+would be like this:
+
+ first job-> 1--2--3--5--6--8--... <- job list
+ on job list |
+ 4 <- preemption stack
+ |
+ current job-> 7
+
+In the example above we see that job 1 was preempted by job 4 and then job 4
+was preempted by job 7. After job 7 is completed, remaining entries of job 4
+are selected, and once they are all selected, job 1 continues.
+
+As we see, it's all very clean and straightforward. Now how does this change
+because of blockers?
+
+The answer is: a lot. Any job may become blocker job at any time, and also
+become normal job again at any time. This has several important implications:
+
+ 1. The jobs may be completed in arbitrary order. For example, in the example
+ above, if the current job 7 becomes blocked, the next job 4 may complete
+ before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked,
+ then 1 is completed, then 7 becomes unblocked and is completed, then 2 is
+ completed and only after that 4 becomes unblocked and is completed... You
+ get the idea.
+
+ [Interesting side note: even when jobs are delivered out of order, from
+ single destination's point of view the jobs are still delivered in the
+ expected order (that is, FIFO unless there was some preemption involved).
+ This is because whenever a destination queue becomes unblocked (the
+ destination limit allows selection of more recipient entries for that
+ destination), all jobs which have peers for that destination are unblocked
+ at once.]
+
+ 2. The idea of the preemption stack at the head of the job list is gone. That
+ is, it must be possible to preempt any job on the job list. For example, if
+ the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3
+ becomes the current job. And of course we do not want the preemption to be
+ affected by the fact that there are some blocked jobs or not. Therefore, if
+ it turns out that job 3 might be preempted by job 6, the implementation
+ shall make it possible.
+
+ 3. The idea of the linear preemption stack itself is gone. It's no longer true
+ that one job is always preempted by only one job at one time (that is
+ directly preempted, not counting the recursively nested jobs). For example,
+ in the example above, job 1 is directly preempted by only job 4, and job 4
+ by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered.
+ If it accumulates enough delivery slots, it is natural that it might be
+ preempted for example by job 8. Now job 4 is preempted by both job 7 AND
+ job 8 at the same time.
+
+Now combine the points 2) and 3) with point 1) again and you realize that the
+relations on the once linear job list became pretty complicated. If we extend
+the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked
+too, then job 4 completes. Tricky, huh?
+
+If I illustrate the relations after the above mentioned examples (but those in
+point 1)), the situation would look like this:
+
+ v- parent
+
+ adoptive parent -> 1--2--3--5--... <- "stack" level 0
+ | |
+ parent gone -> ? 6 <- "stack" level 1
+ / \
+ children -> 7 8 ^- child <- "stack" level 2
+
+ ^- siblings
+
+Now how does nqmgr deal with all these complicated relations?
+
+Well, it maintains them all as described, but fortunately, all these relations
+are necessary only for purposes of proper counting of available delivery slots.
+For purposes of ordering the jobs for entry selection, the original rule still
+applies: "the job preempting the current job is moved in front of the current
+job on the job list". So for entry selection purposes, the job relations remain
+as simple as this:
+
+ 7--8--1--2--6--3--5--.. <- scheduler's job list order
+
+The job list order and the preemption parent/child/siblings relations are
+maintained separately. And because the selection works only with the job list,
+you can happily forget about those complicated relations unless you want to
+study the nqmgr sources. In that case the text above might provide some helpful
+introduction to the problem domain. Otherwise I suggest you just forget about
+all this and stick with the user's point of view: the blocker jobs are simply
+ignored.
+
+[By now, you should have a feeling that there is more things going under the
+hood than you ever wanted to know. You decide that forgetting about this
+chapter is the best you can do for the sake of your mind's health and you
+basically stick with the idea how the scheduler works in ideal conditions, when
+there are no blockers, which is good enough.]
+
+DDeeaalliinngg wwiitthh mmeemmoorryy rreessoouurrccee lliimmiittss
+
+When discussing the nqmgr scheduler, we have so far assumed that all recipients
+of all messages in the active queue are completely read into the memory. This
+is simply not true. There is an upper bound on the amount of memory the nqmgr
+may use, and therefore it must impose some limits on the information it may
+store in the memory at any given time.
+
+First of all, not all messages may be read in-core at once. At any time, only
+qmgr_message_active_limit messages may be read in-core at maximum. When read
+into memory, the messages are picked from the incoming and deferred message
+queues and moved to the active queue (incoming having priority), so if there is
+more than qmgr_message_active_limit messages queued in the active queue, the
+rest will have to wait until (some of) the messages in the active queue are
+completely delivered (or deferred).
+
+Even with the limited amount of in-core messages, there is another limit which
+must be imposed in order to avoid memory exhaustion. Each message may contain
+huge amount of recipients (tens or hundreds of thousands are not uncommon), so
+if nqmgr read all recipients of all messages in the active queue, it may easily
+run out of memory. Therefore there must be some upper bound on the amount of
+message recipients which are read into the memory at the same time.
+
+Before discussing how exactly nqmgr implements the recipient limits, let's see
+how the sole existence of the limits themselves affects the nqmgr and its
+scheduler.
+
+The message limit is straightforward - it just limits the size of the lookahead
+the nqmgr's scheduler has when choosing which message can preempt the current
+one. Messages not in the active queue simply are not considered at all.
+
+The recipient limit complicates more things. First of all, the message reading
+code must support reading the recipients in batches, which among other things
+means accessing the queue file several times and continuing where the last
+recipient batch ended. This is invoked by the scheduler whenever the current
+job has space for more recipients, subject to transport's refill_limit and
+refill_delay parameters. It is also done any time when all in-core recipients
+of the message are dealt with (which may also mean they were deferred) but
+there are still more in the queue file.
+
+The second complication is that with some recipients left unread in the queue
+file, the scheduler can't operate with exact counts of recipient entries. With
+unread recipients, it is not clear how many recipient entries there will be, as
+they are subject to per-destination grouping. It is not even clear to what
+transports (and thus jobs) the recipients will be assigned. And with messages
+coming from the deferred queue, it is not even clear how many unread recipients
+are still to be delivered. This all means that the scheduler must use only
+estimates of how many recipients entries there will be. Fortunately, it is
+possible to estimate the minimum and maximum correctly, so the scheduler can
+always err on the safe side. Obviously, the better the estimates, the better
+results, so it is best when we are able to read all recipients in-core and turn
+the estimates into exact counts, or at least try to read as many as possible to
+make the estimates as accurate as possible.
+
+The third complication is that it is no longer true that the scheduler is done
+with a job once all of its in-core recipients are delivered. It is possible
+that the job will be revived later, when another batch of recipients is read in
+core. It is also possible that some jobs will be created for the first time
+long after the first batch of recipients was read in core. The nqmgr code must
+be ready to handle all such situations.
+
+And finally, the fourth complication is that the nqmgr code must somehow impose
+the recipient limit itself. Now how does it achieve it?
+
+Perhaps the easiest solution would be to say that each message may have at
+maximum X recipients stored in-core, but such solution would be poor for
+several reasons. With reasonable qmgr_message_active_limit values, the X would
+have to be quite low to maintain reasonable memory footprint. And with low X
+lots of things would not work well. The nqmgr would have problems to use the
+transport_destination_recipient_limit efficiently. The scheduler's preemption
+would be suboptimal as the recipient count estimates would be inaccurate. The
+message queue file would have to be accessed many times to read in more
+recipients again and again.
+
+Therefore it seems reasonable to have a solution which does not use a limit
+imposed on per-message basis, but which maintains a pool of available recipient
+slots, which can be shared among all messages in the most efficient manner. And
+as we do not want separate transports to compete for resources whenever
+possible, it seems appropriate to maintain such recipient pool for each
+transport separately. This is the general idea, now how does it work in
+practice?
+
+First we have to solve little chicken-and-egg problem. If we want to use the
+per-transport recipient pools, we first need to know to what transport(s) is
+the message assigned. But we will find that out only after we read in the
+recipients first. So it is obvious that we first have to read in some
+recipients, use them to find out to what transports is the message to be
+assigned, and only after that we can use the per-transport recipient pools.
+
+Now how many recipients shall we read for the first time? This is what
+qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control.
+The qmgr_message_recipient_minimum value specifies how many recipients of each
+message we will read for the first time, no matter what. It is necessary to
+read at least one recipient before we can assign the message to a transport and
+create the first job. However, reading only qmgr_message_recipient_minimum
+recipients even if there are only few messages with few recipients in-core
+would be wasteful. Therefore if there is less than qmgr_message_recipient_limit
+recipients in-core so far, the first batch of recipients may be larger than
+qmgr_message_recipient_minimum - as large as is required to reach the
+qmgr_message_recipient_limit limit.
+
+Once the first batch of recipients was read in core and the message jobs were
+created, the size of the subsequent recipient batches (if any - of course it's
+best when all recipients are read in one batch) is based solely on the position
+of the message jobs on their corresponding transports' job lists. Each
+transport has a pool of transport_recipient_limit recipient slots which it can
+distribute among its jobs (how this is done is described later). The subsequent
+recipient batch may be as large as the sum of all recipient slots of all jobs
+of the message permits (plus the qmgr_message_recipient_minimum amount which
+always applies).
+
+For example, if a message has three jobs, first with 1 recipient still in-core
+and 4 recipient slots, second with 5 recipient in-core and 5 recipient slots,
+and third with 2 recipients in-core and 0 recipient slots, it has 1+5+2=7
+recipients in-core and 4+5+0=9 jobs' recipients slots in total. This means that
+we could immediately read 2+qmgr_message_recipient_minimum more recipients of
+that message in core.
+
+The above example illustrates several things which might be worth mentioning
+explicitly: first, note that although the per-transport slots are assigned to
+particular jobs, we can't guarantee that once the next batch of recipients is
+read in core, that the corresponding amounts of recipients will be assigned to
+those jobs. The jobs lend its slots to the message as a whole, so it is
+possible that some jobs end up sponsoring other jobs of their message. For
+example, if in the example above the 2 newly read recipients were assigned to
+the second job, the first job sponsored the second job with 2 slots. The second
+notable thing is the third job, which has more recipients in-core than it has
+slots. Apart from sponsoring by other job we just saw it can be result of the
+first recipient batch, which is sponsored from global recipient pool of
+qmgr_message_recipient_limit recipients. It can be also sponsored from the
+message recipient pool of qmgr_message_recipient_minimum recipients.
+
+Now how does each transport distribute the recipient slots among its jobs? The
+strategy is quite simple. As most scheduler activity happens on the head of the
+job list, it is our intention to make sure that the scheduler has the best
+estimates of the recipient counts for those jobs. As we mentioned above, this
+means that we want to try to make sure that the messages of those jobs have all
+recipients read in-core. Therefore the transport distributes the slots "along"
+the job list from start to end. In this case the job list sorted by message
+enqueue time is used, because it doesn't change over time as the scheduler's
+job list does.
+
+More specifically, each time a job is created and appended to the job list, it
+gets all unused recipient slots from its transport's pool. It keeps them until
+all recipients of its message are read. When this happens, all unused recipient
+slots are transferred to the next job (which is now in fact now first such job)
+on the job list which still has some recipients unread, or eventually back to
+the transport pool if there is no such job. Such transfer then also happens
+whenever a recipient entry of that job is delivered.
+
+There is also a scenario when a job is not appended to the end of the job list
+(for example it was created as a result of second or later recipient batch).
+Then it works exactly as above, except that if it was put in front of the first
+unread job (that is, the job of a message which still has some unread
+recipients in queue file), that job is first forced to return all of its unused
+recipient slots to the transport pool.
+
+The algorithm just described leads to the following state: The first unread job
+on the job list always gets all the remaining recipient slots of that transport
+(if there are any). The jobs queued before this job are completely read (that
+is, all recipients of their message were already read in core) and have at
+maximum as many slots as they still have recipients in-core (the maximum is
+there because of the sponsoring mentioned before) and the jobs after this job
+get nothing from the transport recipient pool (unless they got something before
+and then the first unread job was created and enqueued in front of them later -
+in such case the also get at maximum as many slots as they have recipients in-
+core).
+
+Things work fine in such state for most of the time, because the current job is
+either completely read in-core or has as much recipient slots as there are, but
+there is one situation which we still have to take care of specially. Imagine
+if the current job is preempted by some unread job from the job list and there
+are no more recipient slots available, so this new current job could read only
+batches of qmgr_message_recipient_minimum recipients at a time. This would
+really degrade performance. For this reason, each transport has extra pool of
+transport_extra_recipient_limit recipient slots, dedicated exactly for this
+situation. Each time an unread job preempts the current job, it gets half of
+the remaining recipient slots from the normal pool and this extra pool.
+
+And that's it. It sure does sound pretty complicated, but fortunately most
+people don't really have to care how exactly it works as long as it works.
+Perhaps the only important things to know for most people are the following
+upper bound formulas:
+
+Each transport has at maximum
+
+ max(
+ qmgr_message_recipient_minimum * qmgr_message_active_limit
+ + *_recipient_limit + *_extra_recipient_limit,
+ qmgr_message_recipient_limit
+ )
+
+recipients in core.
+
+The total amount of recipients in core is
+
+ max(
+ qmgr_message_recipient_minimum * qmgr_message_active_limit
+ + sum( *_recipient_limit + *_extra_recipient_limit ),
+ qmgr_message_recipient_limit
+ )
+
+where the sum is over all used transports.
+
+And this terribly complicated chapter concludes the documentation of nqmgr
+scheduler.
+
+[By now you should theoretically know the nqmgr scheduler inside out. In
+practice, you still hope that you will never have to really understand the last
+or last two chapters completely, and fortunately most people really won't.
+Understanding how the scheduler works in ideal conditions is more than good
+enough for vast majority of users.]
+
+CCrreeddiittss
+
+ * Wietse Venema designed and implemented the initial queue manager with per-
+ domain FIFO scheduling, and per-delivery +/-1 concurrency feedback.
+ * Patrik Rak designed and implemented preemption where mail with fewer
+ recipients can slip past mail with more recipients in a controlled manner,
+ and wrote up its documentation.
+ * Wietse Venema initiated a discussion with Patrik Rak and Victor Duchovni on
+ alternatives for the +/-1 feedback scheduler's aggressive behavior. This is
+ when K/N feedback was reviewed (N = concurrency). The discussion ended
+ without a good solution for both negative feedback and dead site detection.
+ * Victor Duchovni resumed work on concurrency feedback in the context of
+ concurrency-limited servers.
+ * Wietse Venema then re-designed the concurrency scheduler in terms of the
+ simplest possible concepts: less-than-1 concurrency feedback per delivery,
+ forward and reverse concurrency feedback hysteresis, and pseudo-cohort
+ failure. At this same time, concurrency feedback was separated from dead
+ site detection.
+ * These simplifications, and their modular implementation, helped to develop
+ further insights into the different roles that positive and negative
+ concurrency feedback play, and helped to identify some worst-case
+ scenarios.
+