From b7c15c31519dc44c1f691e0466badd556ffe9423 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 18:18:56 +0200 Subject: Adding upstream version 3.7.10. Signed-off-by: Daniel Baumann --- html/SCHEDULER_README.html | 1839 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1839 insertions(+) create mode 100644 html/SCHEDULER_README.html (limited to 'html/SCHEDULER_README.html') diff --git a/html/SCHEDULER_README.html b/html/SCHEDULER_README.html new file mode 100644 index 0000000..f303f33 --- /dev/null +++ b/html/SCHEDULER_README.html @@ -0,0 +1,1839 @@ + + + + + + +Postfix Queue Scheduler + + + + + + + +

Postfix +Queue Scheduler

+ +
+ +

Disclaimer

+ +

Many of the transport-specific configuration parameters +discussed in this document will not show up in "postconf" command +output before Postfix version 2.9. This limitation applies to many +parameters whose name is a combination of a master.cf service name +such as "relay" and a built-in suffix such as +"_destination_concurrency_limit".

+ +

Overview

+ +

The queue manager is by far the most complex part of the Postfix +mail system. It schedules delivery of new mail, retries failed +deliveries at specific times, and removes mail from the queue after +the last delivery attempt. There are two major classes of mechanisms +that control the operation of the queue manager.

+ +

Topics covered by this document:

+ + + + + +

Concurrency scheduling

+ +

The following sections document the Postfix 2.5 concurrency +scheduler, after a discussion of the limitations of the earlier +concurrency scheduler. This is followed by results of medium-concurrency +experiments, and a discussion of trade-offs between performance and +robustness.

+ +

The material is organized as follows:

+ + + +

Drawbacks of the existing +concurrency scheduler

+ +

From the start, Postfix has used a simple but robust algorithm +where the per-destination delivery concurrency is decremented by 1 +after delivery failed due to connection or handshake failure, and +incremented by 1 otherwise. Of course the concurrency is never +allowed to exceed the maximum per-destination concurrency limit. +And when a destination's concurrency level drops to zero, the +destination is declared "dead" and delivery is suspended.

+ +

Drawbacks of +/-1 concurrency feedback per delivery are:

+ +

+ +

(*) A pseudo-cohort is a number of delivery requests equal to +a destination's delivery concurrency.

+ +

The revised concurrency scheduler has a highly modular structure. +It uses separate mechanisms for per-destination concurrency control +and for "dead destination" detection. The concurrency control in +turn is built from two separate mechanisms: it supports less-than-1 +feedback per delivery to allow for more gradual concurrency +adjustments, and it uses feedback hysteresis to suppress concurrency +oscillations. And instead of waiting for delivery concurrency to +throttle down to zero, a destination is declared "dead" after a +configurable number of pseudo-cohorts reports connection or handshake +failure.

+ +

Summary of the Postfix 2.5 +concurrency feedback algorithm

+ +

We want to increment a destination's delivery concurrency when +some (not necessarily consecutive) number of deliveries complete +without connection or handshake failure. This is implemented with +positive feedback g(N) where N is the destination's delivery +concurrency. With g(N)=1 feedback per delivery, concurrency increases +by 1 after each positive feedback event; this gives us the old +scheduler's exponential growth in time. With g(N)=1/N feedback per +delivery, concurrency increases by 1 after an entire pseudo-cohort +N of positive feedback reports; this gives us linear growth in time. +Less-than-1 feedback per delivery and integer truncation naturally +give us hysteresis, so that transitions to larger concurrency happen +every 1/g(N) positive feedback events.

+ +

We want to decrement a destination's delivery concurrency when +some (not necessarily consecutive) number of deliveries complete +after connection or handshake failure. This is implemented with +negative feedback f(N) where N is the destination's delivery +concurrency. With f(N)=1 feedback per delivery, concurrency decreases +by 1 after each negative feedback event; this gives us the old +scheduler's behavior where concurrency is throttled down dramatically +after a single pseudo-cohort failure. With f(N)=1/N feedback per +delivery, concurrency backs off more gently. Again, less-than-1 +feedback per delivery and integer truncation naturally give us +hysteresis, so that transitions to lower concurrency happen every +1/f(N) negative feedback events.

+ +

However, with negative feedback we introduce a subtle twist. +We "reverse" the negative hysteresis cycle so that the transition +to lower concurrency happens at the beginning of a sequence +of 1/f(N) negative feedback events. Otherwise, a correction for +overload would be made too late. This makes the choice of f(N) +relatively unimportant, as borne out by measurements later in this +document.

+ +

In summary, the main ingredients for the Postfix 2.5 concurrency +feedback algorithm are a) the option of less-than-1 positive feedback +per delivery to avoid overwhelming servers, b) the option of +less-than-1 negative feedback per delivery to avoid giving up too +fast, c) feedback hysteresis to avoid rapid oscillation, and d) a +"reverse" hysteresis cycle for negative feedback, so that it can +correct for overload quickly.

+ +

Summary of the Postfix 2.5 "dead destination" detection algorithm

+ +

We want to suspend deliveries to a specific destination after +some number of deliveries suffers connection or handshake failure. +The old scheduler declares a destination "dead" when negative (-1) +feedback throttles the delivery concurrency down to zero. With +less-than-1 feedback per delivery, this throttling down would +obviously take too long. We therefore have to separate "dead +destination" detection from concurrency feedback. This is implemented +by introducing the concept of pseudo-cohort failure. The Postfix +2.5 concurrency scheduler declares a destination "dead" after a +configurable number of pseudo-cohorts suffers from connection or +handshake failures. The old scheduler corresponds to the special +case where the pseudo-cohort failure limit is equal to 1.

+ +

Pseudocode for the Postfix 2.5 concurrency scheduler

+ +

The pseudo code shows how the ideas behind new concurrency +scheduler are implemented as of November 2007. The actual code can +be found in the module qmgr/qmgr_queue.c.

+ +
+Types:
+        Each destination has one set of the following variables
+        int concurrency
+        double success
+        double failure
+        double fail_cohorts
+
+Feedback functions:
+        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
+        positive feedback: g(N) = x/N | x/sqrt(N) | x
+        negative feedback: f(N) = y/N | y/sqrt(N) | y
+
+Initialization:
+        concurrency = initial_concurrency
+        success = 0
+        failure = 0
+        fail_cohorts = 0
+
+After success:
+        fail_cohorts = 0
+        Be prepared for feedback > hysteresis, or rounding error
+        success += g(concurrency)
+        while (success >= 1)            Hysteresis 1
+            concurrency += 1            Hysteresis 1
+            failure = 0
+            success -= 1                Hysteresis 1
+        Be prepared for overshoot
+        if (concurrency > concurrency limit)
+            concurrency = concurrency limit
+
+Safety:
+        Don't apply positive feedback unless
+            concurrency < busy_refcount + init_dest_concurrency
+        otherwise negative feedback effect could be delayed
+
+After failure:
+        if (concurrency > 0)
+            fail_cohorts += 1.0 / concurrency
+            if (fail_cohorts > cohort_failure_limit)
+                concurrency = 0
+        if (concurrency > 0)
+            Be prepared for feedback > hysteresis, rounding errors
+            failure -= f(concurrency)
+            while (failure < 0)
+                concurrency -= 1        Hysteresis 1
+                failure += 1            Hysteresis 1
+                success = 0
+            Be prepared for overshoot
+            if (concurrency < 1)
+                concurrency = 1
+
+ +

Results for delivery to concurrency-limited servers

+ +

Discussions about the concurrency scheduler redesign started +early 2004, when the primary goal was to find alternatives that did +not exhibit exponential growth or rapid concurrency throttling. No +code was implemented until late 2007, when the primary concern had +shifted towards better handling of server concurrency limits. For +this reason we measure how well the new scheduler does this +job. The table below compares mail delivery performance of the old ++/-1 feedback per delivery with several less-than-1 feedback +functions, for different limited-concurrency server scenarios. +Measurements were done with a FreeBSD 6.2 client and with FreeBSD +6.2 and various Linux servers.

+ +

Server configuration:

+ + + +

Client configuration:

+ + + +

Impact of the 30s SMTP connect timeout

+ +

The first results are for a FreeBSD 6.2 server, where our +artificially low listen(2) backlog results in a very short kernel +queue for established connections. The table shows that all deferred +deliveries failed due to a 30s connection timeout, and none failed +due to a server greeting timeout. This measurement simulates what +happens when the server's connection queue is completely full under +load, and the TCP engine drops new connections.

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + +
client
limit
server
limit
feedback
+style
connection
caching
percentage
+deferred
client concurrency
average/stddev
timed-out in
connect/greeting

20 5 1/N no 9.9 19.4 0.49 198 -
20 5 1/N yes 10.3 19.4 0.49 206 -
20 5 1/sqrt(N) no10.4 19.6 0.59 208 -
20 5 1/sqrt(N) yes10.6 19.6 0.61 212 -
20 5 1 no 10.1 19.5 1.29 202 -
20 5 1 yes 10.8 19.3 1.57 216 -

+ +

A busy server with a completely full connection queue. N is +the client delivery concurrency. Failed deliveries time out after +30s without completing the TCP handshake. See text for a discussion +of results.

+ +
+ +

Impact of the 300s SMTP greeting timeout

+ +

The next table shows results for a Fedora Core 8 server (results +for RedHat 7.3 are identical). In this case, the artificially small +listen(2) backlog argument does not impact our measurement. The +table shows that practically all deferred deliveries fail after the +300s SMTP greeting timeout. As these timeouts were 10x longer than +with the first measurement, we increased the recipient count (and +thus the running time) by a factor of 10 to keep the results +comparable. The deferred mail percentages are a factor 10 lower +than with the first measurement, because the 1s per-recipient delay +was 1/300th of the greeting timeout instead of 1/30th of the +connection timeout.

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + +
client
limit
server
limit
feedback
+style
connection
caching
percentage
+deferred
client concurrency
average/stddev
timed-out in
connect/greeting

20 5 1/N no 1.16 19.8 0.37 - 230
20 5 1/N yes 1.36 19.8 0.36 - 272
20 5 1/sqrt(N) no1.21 19.9 0.23 4 238
20 5 1/sqrt(N) yes1.36 20.0 0.23 - 272
20 5 1 no 1.18 20.0 0.16 - 236
20 5 1 yes 1.39 20.0 0.16 - 278

+ +

A busy server with a non-full connection queue. N is the client +delivery concurrency. Failed deliveries complete at the TCP level, +but time out after 300s while waiting for the SMTP greeting. See +text for a discussion of results.

+ +
+ +

Impact of active server concurrency limiter

+ +

The final concurrency-limited result shows what happens when +SMTP connections don't time out, but are rejected immediately with +the Postfix server's smtpd_client_connection_count_limit feature +(the server replies with a 421 status and disconnects immediately). +Similar results can be expected with concurrency limiting features +built into other MTAs or firewalls. For this measurement we specified +a server concurrency limit and a client initial destination concurrency +of 5, and a server process limit of 10; all other conditions were +the same as with the first measurement. The same result would be +obtained with a FreeBSD or Linux server, because the "pushing back" +is done entirely by the receiving side.

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + +
client
limit
server
limit
feedback
+style
connection
caching
percentage
+deferred
client concurrency
average/stddev
theoretical
defer rate

20 5 1/N no 16.5 5.17 0.38 1/6
20 5 1/N yes 16.5 5.17 0.38 1/6
20 5 1/sqrt(N) no24.5 5.28 0.45 1/4
20 5 1/sqrt(N) yes24.3 5.28 0.46 1/4
20 5 1 no 49.7 5.63 0.67 1/2
20 5 1 yes 49.7 5.68 0.70 1/2

+ +

A server with active per-client concurrency limiter that replies +with 421 and disconnects. N is the client delivery concurrency. +The theoretical defer rate is 1/(1+roundup(1/feedback)). This is +always 1/2 with the fixed +/-1 feedback per delivery; with the +concurrency-dependent feedback variants, the defer rate decreases +with increasing concurrency. See text for a discussion of results. +

+ +
+ +

Discussion of concurrency-limited server results

+ +

All results in the previous sections are based on the first +delivery runs only; they do not include any second etc. delivery +attempts. It's also worth noting that the measurements look at +steady-state behavior only. They don't show what happens when the +client starts sending at a much higher or lower concurrency. +

+ +

The first two examples show that the effect of feedback +is negligible when concurrency is limited due to congestion. This +is because the initial concurrency is already at the client's +concurrency maximum, and because there is 10-100 times more positive +than negative feedback. Under these conditions, it is no surprise +that the contribution from SMTP connection caching is also negligible. +

+ +

In the last example, the old +/-1 feedback per delivery will +defer 50% of the mail when confronted with an active (anvil-style) +server concurrency limit, where the server hangs up immediately +with a 421 status (a TCP-level RST would have the same result). +Less aggressive feedback mechanisms fare better than more aggressive +ones. Concurrency-dependent feedback fares even better at higher +concurrencies than shown here, but has limitations as discussed in +the next section.

+ +

Limitations of less-than-1 per delivery feedback

+ +

Less-than-1 feedback is of interest primarily when sending large +amounts of mail to destinations with active concurrency limiters +(servers that reply with 421, or firewalls that send RST). When +sending small amounts of mail per destination, less-than-1 per-delivery +feedback won't have a noticeable effect on the per-destination +concurrency, because the number of deliveries to the same destination +is too small. You might just as well use zero per-delivery feedback +and stay with the initial per-destination concurrency. And when +mail deliveries fail due to congestion instead of active concurrency +limiters, the measurements above show that per-delivery feedback +has no effect. With large amounts of mail you might just as well +use zero per-delivery feedback and start with the maximal per-destination +concurrency.

+ +

The scheduler with less-than-1 concurrency +feedback per delivery solves a problem with servers that have active +concurrency limiters. This works only because feedback is handled +in a peculiar manner: positive feedback will increment the concurrency +by 1 at the end of a sequence of events of length 1/feedback, +while negative feedback will decrement concurrency by 1 at the +beginning of such a sequence. This is how Postfix adjusts +quickly for overshoot without causing lots of mail to be deferred. +Without this difference in feedback treatment, less-than-1 feedback +per delivery would defer 50% of the mail, and would be no better +in this respect than the old +/-1 feedback per delivery.

+ +

Unfortunately, the same feature that corrects quickly for +concurrency overshoot also makes the scheduler more sensitive for +noisy negative feedback. The reason is that one lonely negative +feedback event has the same effect as a complete sequence of length +1/feedback: in both cases delivery concurrency is dropped by 1 +immediately. As a worst-case scenario, consider multiple servers +behind a load balancer on a single IP address, and no backup MX +address. When 1 out of K servers fails to complete the SMTP handshake +or drops the connection, a scheduler with 1/N (N = concurrency) +feedback stops increasing its concurrency once it reaches a concurrency +level of about K, even though the good servers behind the load +balancer are perfectly capable of handling more traffic.

+ +

This noise problem gets worse as the amount of positive feedback +per delivery gets smaller. A compromise is to use fixed less-than-1 +positive feedback values instead of concurrency-dependent positive +feedback. For example, to tolerate 1 of 4 bad servers in the above +load balancer scenario, use positive feedback of 1/4 per "good" +delivery (no connect or handshake error), and use an equal or smaller +amount of negative feedback per "bad" delivery. The downside of +using concurrency-independent feedback is that some of the old +/-1 +feedback problems will return at large concurrencies. Sites that +must deliver mail at non-trivial per-destination concurrencies will +require special configuration.

+ +

Concurrency configuration parameters

+ +

The Postfix 2.5 concurrency scheduler is controlled with the +following configuration parameters, where "transport_foo" +provides a transport-specific parameter override. All parameter +default settings are compatible with earlier Postfix versions.

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + +
Parameter name Postfix version +Description

initial_destination_concurrency
+transport_initial_destination_concurrency
all
2.5
Initial per-destination +delivery concurrency
default_destination_concurrency_limit
+transport_destination_concurrency_limit
+all
all
Maximum per-destination delivery concurrency +
default_destination_concurrency_positive_feedback
+transport_destination_concurrency_positive_feedback
2.5
2.5
Per-destination positive +feedback amount, per delivery that does not fail with connection +or handshake failure
default_destination_concurrency_negative_feedback
+transport_destination_concurrency_negative_feedback
2.5
2.5
Per-destination negative +feedback amount, per delivery that fails with connection or handshake +failure
default_destination_concurrency_failed_cohort_limit
+transport_destination_concurrency_failed_cohort_limit
2.5
2.5
Number of failed +pseudo-cohorts after which a destination is declared "dead" and +delivery is suspended
destination_concurrency_feedback_debug +2.5 Enable verbose logging of concurrency scheduler +activity

+ +
+ +

Preemptive scheduling

+ +

+ +The following sections describe the new queue manager and its +preemptive scheduler algorithm. Note that the document was originally +written to describe the changes between the new queue manager (in +this text referred to as nqmgr, the name it was known by +before it became the default queue manager) and the old queue manager +(referred to as oqmgr). This is why it refers to oqmgr +every so often. + +

+ +

+ +This document is divided into sections as follows: + +

+ + + +

The structures used by +nqmgr

+ +

+ +Let's start by recapitulating the structures and terms used when +referring to the queue manager and how it operates. Many of these are +partially described elsewhere, but it is nice to have a coherent +overview in one place: + +

+ + + +

+ +The first four structures are common to both nqmgr and +oqmgr, the latter two were introduced by nqmgr. + +

+ +

+ +These terms are used extensively in the text below, feel free to +look up the description above anytime you'll feel you have lost a +sense what is what. + +

+ +

What happens when nqmgr picks +up the message

+ +

+ +Whenever nqmgr moves a queue file into the "active" queue, +the following happens: It reads all necessary information from the +queue file as oqmgr does, and also reads as many recipients +as possible - more on that later, for now let's just pretend it +always reads all recipients. + +

+ +

+ +Then it resolves the recipients as oqmgr does, which +means obtaining (address, nexthop, transport) triple for each +recipient. For each triple, it finds the transport; if it does not +exist yet, it instantiates it (unless it's dead). Within the +transport, it finds the destination queue for the given nexthop; if it +does not exist yet, it instantiates it (unless it's dead). The +triple is then bound to given destination queue. This happens in +qmgr_resolve() and is basically the same as in oqmgr. + +

+ +

+ +Then for each triple which was bound to some queue (and thus +transport), the program finds the job which represents the message +within that transport's context; if it does not exist yet, it +instantiates it. Within the job, it finds the peer which represents +the bound destination queue within this jobs context; if it does +not exist yet, it instantiates it. Finally, it stores the address +from the resolved triple to the recipient entry which is appended +to both the queue entry list and the peer entry list. The addresses +for the same nexthop are batched in the entries up to the +transport_destination_recipient_limit for that transport. +This happens in qmgr_message_assign(), and apart +from that it operates with job and peer structures, it is basically the +same as in oqmgr. + +

+ +

+ +When the job is instantiated, it is enqueued on the transport's job +list based on the time its message was picked up by nqmgr. +For first batch of recipients this means it is appended to the end +of the job list, but the ordering of the job list by the enqueue +time is important as we will see shortly. + +

+ +

+ +[Now you should have a pretty good idea what the state of the +nqmgr is after a couple of messages were picked up, and what the +relation is between all those job, peer, queue and entry structures.] + +

+ +

How the entry selection +works

+ +

+ +Having prepared all those above mentioned structures, the task of +the nqmgr's scheduler is to choose the recipient entries +one at a time and pass them to the delivery agent for corresponding +transport. Now how does this work? + +

+ +

+ +The first approximation of the new scheduling algorithm is like this: + +

+ +
+
+foreach transport (round-robin-by-transport)
+do
+    if transport busy continue
+    if transport process limit reached continue
+    foreach transport's job (in the order of the transport's job list)
+    do
+        foreach job's peer (round-robin-by-destination)
+             if peer->queue->concurrency < peer->queue->window
+                 return next peer entry.
+        done
+    done
+done
+
+
+ +

+ +Now what is the "order of the transport's job list"? As we know +already, the job list is by default kept in the order the message +was picked up by the nqmgr. So by default we get the +top-level round-robin transport, and within each transport we get +the FIFO message delivery. The round-robin of the peers by the +destination is perhaps of little importance in most real-life cases +(unless the transport_destination_recipient_limit is reached, +in one job there +is only one peer structure for each destination), but theoretically +it makes sure that even within single jobs, destinations are treated +fairly. + +

+ +

+ +[By now you should have a feeling you really know how the scheduler +works, except for the preemption, under ideal conditions - that is, +no recipient resource limits and no destination concurrency problems.] + +

+ +

How the preemption +works

+ +

+ +As you might perhaps expect by now, the transport's job list does +not remain sorted by the job's message enqueue time all the time. +The most cool thing about nqmgr is not the simple FIFO +delivery, but that it is able to slip mail with little recipients +past the mailing-list bulk mail. This is what the job preemption +is about - shuffling the jobs on the transport's job list to get +the best message delivery rates. Now how is it achieved? + +

+ +

+ +First I have to tell you that there are in fact two job lists in +each transport. One is the scheduler's job list, which the scheduler +is free to play with, while the other one keeps the jobs always +listed in the order of the enqueue time and is used for recipient +pool management we will discuss later. For now, we will deal with +the scheduler's job list only. + +

+ +

+ +So, we have the job list, which is first ordered by the time the +jobs' messages were enqueued, oldest messages first, the most recently +picked one at the end. For now, let's assume that there are no +destination concurrency problems. Without preemption, we pick some +entry of the first (oldest) job on the queue, assign it to delivery +agent, pick another one from the same job, assign it again, and so +on, until all the entries are used and the job is delivered. We +would then move onto the next job and so on and on. Now how do we +manage to sneak in some entries from the recently added jobs when +the first job on the job list belongs to a message going to the +mailing-list and has thousands of recipient entries? + +

+ +

+ +The nqmgr's answer is that we can artificially "inflate" +the delivery time of that first job by some constant for free - it +is basically the same trick you might remember as "accumulation of +potential" from the amortized complexity lessons. For example, +instead of delivering the entries of the first job on the job list +every time a delivery agent becomes available, we can do it only +every second time. If you view the moments the delivery agent becomes +available on a timeline as "delivery slots", then instead of using +every delivery slot for the first job, we can use only every other +slot, and still the overall delivery efficiency of the first job +remains the same. So the delivery 11112222 becomes +1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . +denotes the free slot). Now what do we do with free slots? + +

+ +

+ +As you might have guessed, we will use them for sneaking the mail +with little recipients in. For example, if we have one four-recipient +mail followed by four one recipients mail, the delivery sequence +(that is, the sequence in which the jobs are assigned to the +delivery slots) might look like this: 12131415. Hmm, fine +for sneaking in the single recipient mail, but how do we sneak in +the mail with more than one recipient? Say if we have one four-recipient +mail followed by two two-recipient mails? + +

+ +

+ +The simple answer would be to use delivery sequence 12121313. +But the problem is that this does not scale well. Imagine you have +mail with a thousand recipients followed by mail with a hundred recipients. +It is tempting to suggest the delivery sequence like 121212...., +but alas! Imagine there arrives another mail with say ten recipients. +But there are no free slots anymore, so it can't slip by, not even +if it had only one recipient. It will be stuck until the +hundred-recipient mail is delivered, which really sucks. + +

+ +

+ +So, it becomes obvious that while inflating the message to get +free slots is a great idea, one has to be really careful of how the +free slots are assigned, otherwise one might corner himself. So, +how does nqmgr really use the free slots? + +

+ +

+ +The key idea is that one does not have to generate the free slots +in a uniform way. The delivery sequence 111...1 is no +worse than 1.1.1.1, in fact, it is even better as some +entries are in the first case selected earlier than in the second +case, and none is selected later! So it is possible to first +"accumulate" the free delivery slots and then use them all at once. +It is even possible to accumulate some, then use them, then accumulate +some more and use them again, as in 11..1.1 . + +

+ +

+ +Let's get back to the one hundred recipient example. We now know +that we could first accumulate one hundred free slots, and only +after then to preempt the first job and sneak the one hundred +recipient mail in. Applying the algorithm recursively, we see the +hundred recipient job can accumulate ten free delivery slots, and +then we could preempt it and sneak in the ten-recipient mail... +Wait wait wait! Could we? Aren't we overinflating the original one +thousand recipient mail? + +

+ +

+ +Well, despite the fact that it looks so at the first glance, another trick will +allow us to answer "no, we are not!". If we had said that we will +inflate the delivery time twice at maximum, and then we consider +every other slot as a free slot, then we would overinflate in case +of the recursive preemption. BUT! The trick is that if we use only +every n-th slot as a free slot for n>2, there is always some worst +inflation factor which we can guarantee not to be breached, even +if we apply the algorithm recursively. To be precise, if for every +k>1 normally used slots we accumulate one free delivery slot, than +the inflation factor is not worse than k/(k-1) no matter how many +recursive preemptions happen. And it's not worse than (k+1)/k if +only non-recursive preemption happens. Now, having got through the +theory and the related math, let's see how nqmgr implements +this. + +

+ +

+ +Each job has so called "available delivery slot" counter. Each +transport has a transport_delivery_slot_cost parameter, which +defaults to default_delivery_slot_cost parameter which is set to 5 +by default. This is the k from the paragraph above. Each time k +entries of the job are selected for delivery, this counter is +incremented by one. Once there are some slots accumulated, a job which +requires no more than that number of slots to be fully delivered +can preempt this job. + +

+ +

+ +[Well, the truth is, the counter is incremented every time an entry +is selected and it is divided by k when it is used. +But to understand, it's good enough to use +the above approximation of the truth.] + +

+ +

+ +OK, so now we know the conditions which must be satisfied so one +job can preempt another one. But what job gets preempted, how do +we choose what job preempts it if there are several valid candidates, +and when does all this exactly happen? + +

+ +

+ +The answer for the first part is simple. The job whose entry was +selected the last time is the so called current job. Normally, it is +the first job on the scheduler's job list, but destination concurrency +limits may change this as we will see later. It is always only the +current job which may get preempted. + +

+ +

+ +Now for the second part. The current job has a certain amount of +recipient entries, and as such may accumulate at maximum some amount +of available delivery slots. It might have already accumulated some, +and perhaps even already used some when it was preempted before +(remember a job can be preempted several times). In either case, +we know how many are accumulated and how many are left to deliver, +so we know how many it may yet accumulate at maximum. Every other +job which may be delivered by less than that number of slots is a +valid candidate for preemption. How do we choose among them? + +

+ +

+ +The answer is - the one with maximum enqueue_time/recipient_entry_count. +That is, the older the job is, the more we should try to deliver +it in order to get best message delivery rates. These rates are of +course subject to how many recipients the message has, therefore +the division by the recipient (entry) count. No one shall be surprised +that a message with n recipients takes n times longer to deliver than +a message with one recipient. + +

+ +

+ +Now let's recap the previous two paragraphs. Isn't it too complicated? +Why don't the candidates come only among the jobs which can be +delivered within the number of slots the current job already +accumulated? Why do we need to estimate how much it has yet to +accumulate? If you found out the answer, congratulate yourself. If +we did it this simple way, we would always choose the candidate +with the fewest recipient entries. If there were enough single recipient +mails coming in, they would always slip by the bulk mail as soon +as possible, and the two or more recipients mail would never get +a chance, no matter how long they have been sitting around in the +job list. + +

+ +

+ +This candidate selection has an interesting implication - that when +we choose the best candidate for preemption (this is done in +qmgr_choose_candidate()), it may happen that we may not use it for +preemption immediately. This leads to an answer to the last part +of the original question - when does the preemption happen? + +

+ +

+ +The preemption attempt happens every time next transport's recipient +entry is to be chosen for delivery. To avoid needless overhead, the +preemption is not attempted if the current job could never accumulate +more than transport_minimum_delivery_slots (defaults to +default_minimum_delivery_slots which defaults to 3). If there are +already enough accumulated slots to preempt the current job by the +chosen best candidate, it is done immediately. This basically means +that the candidate is moved in front of the current job on the +scheduler's job list and decreasing the accumulated slot counter +by the amount used by the candidate. If there are not enough slots... +well, I could say that nothing happens and the another preemption +is attempted the next time. But that's not the complete truth. + +

+ +

+ +The truth is that it turns out that it is not really necessary to +wait until the jobs counter accumulates all the delivery slots in +advance. Say we have ten-recipient mail followed by two two-recipient +mails. If the preemption happened when enough delivery slots accumulate +(assuming slot cost 2), the delivery sequence becomes +11112211113311. Now what would we get if we would wait +only for 50% of the necessary slots to accumulate and we promise +we would wait for the remaining 50% later, after we get back +to the preempted job? If we use such a slot loan, the delivery sequence +becomes 11221111331111. As we can see, it makes it not +considerably worse for the delivery of the ten-recipient mail, but +it allows the small messages to be delivered sooner. + +

+ +

+ +The concept of these slot loans is where the +transport_delivery_slot_discount and +transport_delivery_slot_loan come from (they default to +default_delivery_slot_discount and default_delivery_slot_loan, whose +values are by default 50 and 3, respectively). The discount (resp. +loan) specifies how many percent (resp. how many slots) one "gets +in advance", when the number of slots required to deliver the best +candidate is compared with the number of slots the current slot had +accumulated so far. + +

+ +

+ +And that pretty much concludes this chapter. + +

+ +

+ +[Now you should have a feeling that you pretty much understand the +scheduler and the preemption, or at least that you will have +after you read the last chapter a couple more times. You shall clearly +see the job list and the preemption happening at its head, in ideal +delivery conditions. The feeling of understanding shall last until +you start wondering what happens if some of the jobs are blocked, +which you might eventually figure out correctly from what had been +said already. But I would be surprised if your mental image of the +scheduler's functionality is not completely shattered once you +start wondering how it works when not all recipients may be read +in-core. More on that later.] + +

+ +

How destination concurrency +limits affect the scheduling algorithm

+ +

+ +The nqmgr uses the same algorithm for destination concurrency +control as oqmgr. Now what happens when the destination +limits are reached and no more entries for that destination may be +selected by the scheduler? + +

+ +

+ +From the user's point of view it is all simple. If some of the peers +of a job can't be selected, those peers are simply skipped by the +entry selection algorithm (the pseudo-code described before) and +only the selectable ones are used. If none of the peers may be +selected, the job is declared a "blocker job". Blocker jobs are +skipped by the entry selection algorithm and they are also excluded +from the candidates for preemption of the current job. Thus the scheduler +effectively behaves as if the blocker jobs didn't exist on the job +list at all. As soon as at least one of the peers of a blocker job +becomes unblocked (that is, the delivery agent handling the delivery +of the recipient entry for the given destination successfully finishes), +the job's blocker status is removed and the job again participates +in all further scheduler actions normally. + +

+ +

+ +So the summary is that the users don't really have to be concerned +about the interaction of the destination limits and scheduling +algorithm. It works well on its own and there are no knobs they +would need to control it. + +

+ +

+ +From a programmer's point of view, the blocker jobs complicate the +scheduler quite a lot. Without them, the jobs on the job list would +be normally delivered in strict FIFO order. If the current job is +preempted, the job preempting it is completely delivered unless it +is preempted itself. Without blockers, the current job is thus +always either the first job on the job list, or the top of the stack +of jobs preempting the first job on the job list. + +

+ +

+ +The visualization of the job list and the preemption stack without +blockers would be like this: + +

+ +
+
+first job->    1--2--3--5--6--8--...    <- job list
+on job list    |
+               4    <- preemption stack
+               |
+current job->  7
+
+
+ +

+ +In the example above we see that job 1 was preempted by job 4 and +then job 4 was preempted by job 7. After job 7 is completed, remaining +entries of job 4 are selected, and once they are all selected, job +1 continues. + +

+ +

+ +As we see, it's all very clean and straightforward. Now how does +this change because of blockers? + +

+ +

+ +The answer is: a lot. Any job may become a blocker job at any time, +and also become a normal job again at any time. This has several +important implications: + +

+ +
    + +
  1. + +The jobs may be completed in arbitrary order. For example, in the +example above, if the current job 7 becomes blocked, the next job +4 may complete before the job 7 becomes unblocked again. Or if both +7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked +and is completed, then 2 is completed and only after that 4 becomes +unblocked and is completed... You get the idea. + +

    + +

    + +[Interesting side note: even when jobs are delivered out of order, +from a single destination's point of view the jobs are still delivered +in the expected order (that is, FIFO unless there was some preemption +involved). This is because whenever a destination queue becomes +unblocked (the destination limit allows selection of more recipient +entries for that destination), all jobs which have peers for that +destination are unblocked at once.] + +

    + +
  2. + +The idea of the preemption stack at the head of the job list is +gone. That is, it must be possible to preempt any job on the job +list. For example, if the jobs 7, 4, 1 and 2 in the example above +become all blocked, job 3 becomes the current job. And of course +we do not want the preemption to be affected by the fact that there +are some blocked jobs or not. Therefore, if it turns out that job +3 might be preempted by job 6, the implementation shall make it +possible. + +

    + +
  3. + +The idea of the linear preemption stack itself is gone. It's no +longer true that one job is always preempted by only one job at one +time (that is directly preempted, not counting the recursively +nested jobs). For example, in the example above, job 1 is directly +preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes +blocked, and job 4 is being delivered. If it accumulates enough +delivery slots, it is natural that it might be preempted for example +by job 8. Now job 4 is preempted by both job 7 AND job 8 at the +same time. + +

    + +
+ +

+ +Now combine the points 2) and 3) with point 1) again and you realize +that the relations on the once linear job list became pretty +complicated. If we extend the point 3) example: jobs 7 and 8 preempt +job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, +huh? + +

+ +

+ +If I illustrate the relations after the above mentioned examples +(but those in point 1), the situation would look like this: + +

+ +
+
+                            v- parent
+
+adoptive parent ->    1--2--3--5--...      <- "stack" level 0
+                      |     |
+parent gone ->        ?     6              <- "stack" level 1
+                     / \
+children ->         7   8   ^- child       <- "stack" level 2
+
+                      ^- siblings
+
+
+ +

+ +Now how does nqmgr deal with all these complicated relations? + +

+ +

+ +Well, it maintains them all as described, but fortunately, all these +relations are necessary only for the purpose of proper counting of +available delivery slots. For the purpose of ordering the jobs for +entry selection, the original rule still applies: "the job preempting +the current job is moved in front of the current job on the job +list". So for entry selection purposes, the job relations remain +as simple as this: + +

+ +
+
+7--8--1--2--6--3--5--..   <- scheduler's job list order
+
+
+ +

+ +The job list order and the preemption parent/child/siblings relations +are maintained separately. And because the selection works only +with the job list, you can happily forget about those complicated +relations unless you want to study the nqmgr sources. In +that case the text above might provide some helpful introduction +to the problem domain. Otherwise I suggest you just forget about +all this and stick with the user's point of view: the blocker jobs +are simply ignored. + +

+ +

+ +[By now, you should have a feeling that there are more things going +on under the hood than you ever wanted to know. You decide that +forgetting about this chapter is the best you can do for the sake +of your mind's health and you basically stick with the idea how the +scheduler works in ideal conditions, when there are no blockers, +which is good enough.] + +

+ +

Dealing with memory resource +limits

+ +

+ +When discussing the nqmgr scheduler, we have so far assumed +that all recipients of all messages in the "active" queue are completely +read into memory. This is simply not true. There is an upper +bound on the amount of memory the nqmgr may use, and +therefore it must impose some limits on the information it may store +in memory at any given time. + +

+ +

+ +First of all, not all messages may be read in-core at once. At any +time, only qmgr_message_active_limit messages may be read in-core +at maximum. When read into memory, the messages are picked from the +"incoming" and "deferred" queues and moved to the "active" queue +(incoming having priority), so if there are more than +qmgr_message_active_limit messages queued in the "active" queue, the +rest will have to wait until (some of) the messages in the "active" queue +are completely delivered (or deferred). + +

+ +

+ +Even with the limited amount of in-core messages, there is another +limit which must be imposed in order to avoid memory exhaustion. +Each message may contain a huge number of recipients (tens or hundreds +of thousands are not uncommon), so if nqmgr read all +recipients of all messages in the "active" queue, it may easily run +out of memory. Therefore there must be some upper bound on the +amount of message recipients which are read into memory at the +same time. + +

+ +

+ +Before discussing how exactly nqmgr implements the recipient +limits, let's see how the sole existence of the limits themselves +affects the nqmgr and its scheduler. + +

+ +

+ +The message limit is straightforward - it just limits the size of +the +lookahead the nqmgr's scheduler has when choosing which +message can preempt the current one. Messages not in the "active" queue +are simply not considered at all. + +

+ +

+ +The recipient limit complicates more things. First of all, the +message reading code must support reading the recipients in batches, +which among other things means accessing the queue file several +times and continuing where the last recipient batch ended. This is +invoked by the scheduler whenever the current job has space for more +recipients, subject to transport's refill_limit and refill_delay parameters. +It is also done any time when all +in-core recipients of the message are dealt with (which may also +mean they were deferred) but there are still more in the queue file. + +

+ +

+ +The second complication is that with some recipients left unread +in the queue file, the scheduler can't operate with exact counts +of recipient entries. With unread recipients, it is not clear how +many recipient entries there will be, as they are subject to +per-destination grouping. It is not even clear to what transports +(and thus jobs) the recipients will be assigned. And with messages +coming from the "deferred" queue, it is not even clear how many unread +recipients are still to be delivered. This all means that the +scheduler must use only estimates of how many recipients entries +there will be. Fortunately, it is possible to estimate the minimum +and maximum correctly, so the scheduler can always err on the safe +side. Obviously, the better the estimates, the better the results, so +it is best when we are able to read all recipients in-core and turn +the estimates into exact counts, or at least try to read as many +as possible to make the estimates as accurate as possible. + +

+ +

+ +The third complication is that it is no longer true that the scheduler +is done with a job once all of its in-core recipients are delivered. +It is possible that the job will be revived later, when another +batch of recipients is read in core. It is also possible that some +jobs will be created for the first time long after the first batch +of recipients was read in core. The nqmgr code must be +ready to handle all such situations. + +

+ +

+ +And finally, the fourth complication is that the nqmgr +code must somehow impose the recipient limit itself. Now how does +it achieve it? + +

+ +

+ +Perhaps the easiest solution would be to say that each message may +have at maximum X recipients stored in-core, but such a solution would +be poor for several reasons. With reasonable qmgr_message_active_limit +values, the X would have to be quite low to maintain a reasonable +memory footprint. And with low X lots of things would not work well. +The nqmgr would have problems to use the +transport_destination_recipient_limit efficiently. The +scheduler's preemption would be suboptimal as the recipient count +estimates would be inaccurate. The message queue file would have +to be accessed many times to read in more recipients again and +again. + +

+ +

+ +Therefore it seems reasonable to have a solution which does not use +a limit imposed on a per-message basis, but which maintains a pool +of available recipient slots, which can be shared among all messages +in the most efficient manner. And as we do not want separate +transports to compete for resources whenever possible, it seems +appropriate to maintain such a recipient pool for each transport +separately. This is the general idea, now how does it work in +practice? + +

+ +

+ +First we have to solve a little chicken-and-egg problem. If we want +to use the per-transport recipient pools, we first need to know to +what transport(s) the message is assigned. But we will find that +out only after we first read in the recipients. So it is obvious +that we first have to read in some recipients, use them to find out +to what transports the message is to be assigned, and only after +that can we use the per-transport recipient pools. + +

+ +

+ +Now how many recipients shall we read for the first time? This is +what qmgr_message_recipient_minimum and qmgr_message_recipient_limit +values control. The qmgr_message_recipient_minimum value specifies +how many recipients of each message we will read the first time, +no matter what. It is necessary to read at least one recipient +before we can assign the message to a transport and create the first +job. However, reading only qmgr_message_recipient_minimum recipients +even if there are only few messages with few recipients in-core would +be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit +recipients in-core so far, the first batch of recipients may be +larger than qmgr_message_recipient_minimum - as large as is required +to reach the qmgr_message_recipient_limit limit. + +

+ +

+ +Once the first batch of recipients was read in core and the message +jobs were created, the size of the subsequent recipient batches (if +any - of course it's best when all recipients are read in one batch) +is based solely on the position of the message jobs on their +corresponding transports' job lists. Each transport has a pool of +transport_recipient_limit recipient slots which it can +distribute among its jobs (how this is done is described later). +The subsequent recipient batch may be as large as the sum of all +recipient slots of all jobs of the message permits (plus the +qmgr_message_recipient_minimum amount which always applies). + +

+ +

+ +For example, if a message has three jobs, the first with 1 recipient +still in-core and 4 recipient slots, the second with 5 recipients in-core +and 5 recipient slots, and the third with 2 recipients in-core and 0 +recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' +recipients slots in total. This means that we could immediately +read 2+qmgr_message_recipient_minimum more recipients of that message +in core. + +

+ +

+ +The above example illustrates several things which might be worth +mentioning explicitly: first, note that although the per-transport +slots are assigned to particular jobs, we can't guarantee that once +the next batch of recipients is read in core, that the corresponding +amounts of recipients will be assigned to those jobs. The jobs lend +its slots to the message as a whole, so it is possible that some +jobs end up sponsoring other jobs of their message. For example, +if in the example above the 2 newly read recipients were assigned +to the second job, the first job sponsored the second job with 2 +slots. The second notable thing is the third job, which has more +recipients in-core than it has slots. Apart from sponsoring by other +job we just saw it can be result of the first recipient batch, which +is sponsored from global recipient pool of qmgr_message_recipient_limit +recipients. It can be also sponsored from the message recipient +pool of qmgr_message_recipient_minimum recipients. + +

+ +

+ +Now how does each transport distribute the recipient slots among +its jobs? The strategy is quite simple. As most scheduler activity +happens on the head of the job list, it is our intention to make +sure that the scheduler has the best estimates of the recipient +counts for those jobs. As we mentioned above, this means that we +want to try to make sure that the messages of those jobs have all +recipients read in-core. Therefore the transport distributes the +slots "along" the job list from start to end. In this case the job +list sorted by message enqueue time is used, because it doesn't +change over time as the scheduler's job list does. + +

+ +

+ +More specifically, each time a job is created and appended to the +job list, it gets all unused recipient slots from its transport's +pool. It keeps them until all recipients of its message are read. +When this happens, all unused recipient slots are transferred to +the next job (which is now in fact the first such job) on the job +list which still has some recipients unread, or eventually back to +the transport pool if there is no such job. Such a transfer then also +happens whenever a recipient entry of that job is delivered. + +

+ +

+ +There is also a scenario when a job is not appended to the end of +the job list (for example it was created as a result of a second or +later recipient batch). Then it works exactly as above, except that +if it was put in front of the first unread job (that is, the job +of a message which still has some unread recipients in the queue file), +that job is first forced to return all of its unused recipient slots +to the transport pool. + +

+ +

+ +The algorithm just described leads to the following state: The first +unread job on the job list always gets all the remaining recipient +slots of that transport (if there are any). The jobs queued before +this job are completely read (that is, all recipients of their +message were already read in core) and have at maximum as many slots +as they still have recipients in-core (the maximum is there because +of the sponsoring mentioned before) and the jobs after this job get +nothing from the transport recipient pool (unless they got something +before and then the first unread job was created and enqueued in +front of them later - in such a case, they also get at maximum as many +slots as they have recipients in-core). + +

+ +

+ +Things work fine in such a state for most of the time, because the +current job is either completely read in-core or has as many recipient +slots as there are, but there is one situation which we still have +to take care of specially. Imagine if the current job is preempted +by some unread job from the job list and there are no more recipient +slots available, so this new current job could read only batches +of qmgr_message_recipient_minimum recipients at a time. This would +really degrade performance. For this reason, each transport has an +extra pool of transport_extra_recipient_limit recipient +slots, dedicated exactly for this situation. Each time an unread +job preempts the current job, it gets half of the remaining recipient +slots from the normal pool and this extra pool. + +

+ +

+ +And that's it. It sure does sound pretty complicated, but fortunately +most people don't really have to care exactly how it works as long +as it works. Perhaps the only important things to know for most +people are the following upper bound formulas: + +

+ +

+ +Each transport has at maximum + +

+ +
+
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ *_recipient_limit + *_extra_recipient_limit,
+qmgr_message_recipient_limit
+)
+
+
+ +

+ +recipients in core. + +

+ +

+ +The total amount of recipients in core is + +

+ +
+
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ sum( *_recipient_limit + *_extra_recipient_limit ),
+qmgr_message_recipient_limit
+)
+
+
+ +

+ +where the sum is over all used transports. + +

+ +

+ +And this terribly complicated chapter concludes the documentation +of the nqmgr scheduler. + +

+ +

+ +[By now you should theoretically know the nqmgr scheduler +inside out. In practice, you still hope that you will never have +to really understand the last or last two chapters completely, and +fortunately most people really won't. Understanding how the scheduler +works in ideal conditions is more than good enough for the vast majority +of users.] + +

+ +

Credits

+ + + + + + -- cgit v1.2.3