937 lines
48 KiB
HTML
937 lines
48 KiB
HTML
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"https://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<title>Postfix Bottleneck Analysis</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<link rel='stylesheet' type='text/css' href='postfix-doc.css'>
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix Bottleneck Analysis</h1>
|
|
|
|
<hr>
|
|
|
|
<h2>Purpose of this document </h2>
|
|
|
|
<p> This document is an introduction to Postfix queue congestion analysis.
|
|
It explains how the <a href="qshape.1.html">qshape(1)</a> program can help to track down the
|
|
reason for queue congestion. <a href="qshape.1.html">qshape(1)</a> is bundled with Postfix
|
|
2.1 and later source code, under the "auxiliary" directory. This
|
|
document describes <a href="qshape.1.html">qshape(1)</a> as bundled with Postfix 2.4. </p>
|
|
|
|
<p> This document covers the following topics: </p>
|
|
|
|
<ul>
|
|
|
|
<li><a href="#qshape">Introducing the qshape tool</a>
|
|
|
|
<li><a href="#trouble_shooting">Trouble shooting with qshape</a>
|
|
|
|
<li><a href="#healthy">Example 1: Healthy queue</a>
|
|
|
|
<li><a href="#dictionary_bounce">Example 2: Deferred queue full of
|
|
dictionary attack bounces</a></li>
|
|
|
|
<li><a href="#active_congestion">Example 3: Congestion in the active
|
|
queue</a></li>
|
|
|
|
<li><a href="#backlog">Example 4: High volume destination backlog</a>
|
|
|
|
<li><a href="#queues">Postfix queue directories</a>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="#maildrop_queue"> The "maildrop" queue </a>
|
|
|
|
<li> <a href="#hold_queue"> The "hold" queue </a>
|
|
|
|
<li> <a href="#incoming_queue"> The "incoming" queue </a>
|
|
|
|
<li> <a href="#active_queue"> The "active" queue </a>
|
|
|
|
<li> <a href="#deferred_queue"> The "deferred" queue </a>
|
|
|
|
</ul>
|
|
|
|
<li><a href="#credits">Credits</a>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="qshape">Introducing the qshape tool</a></h2>
|
|
|
|
<p> When mail is draining slowly or the queue is unexpectedly large,
|
|
run <a href="qshape.1.html">qshape(1)</a> as the super-user (root) to help zero in on the problem.
|
|
The <a href="qshape.1.html">qshape(1)</a> program displays a tabular view of the Postfix queue
|
|
contents. </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> On the horizontal axis, it displays the queue age with
|
|
fine granularity for recent messages and (geometrically) less fine
|
|
granularity for older messages. </p>
|
|
|
|
<li> <p> The vertical axis displays the destination (or with the
|
|
"-s" switch the sender) domain. Domains with the most messages are
|
|
listed first. </p>
|
|
|
|
</ul>
|
|
|
|
<p> For example, in the output below we see the top 10 lines of
|
|
the (mostly forged) sender domain distribution for captured spam
|
|
in the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a>: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape -s hold | head
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 486 0 0 1 0 0 2 4 20 40 419
|
|
yahoo.com 14 0 0 1 0 0 0 0 1 0 12
|
|
extremepricecuts.net 13 0 0 0 0 0 0 0 2 0 11
|
|
ms35.hinet.net 12 0 0 0 0 0 0 0 0 1 11
|
|
winnersdaily.net 12 0 0 0 0 0 0 0 2 0 10
|
|
hotmail.com 11 0 0 0 0 0 0 0 0 1 10
|
|
worldnet.fr 6 0 0 0 0 0 0 0 0 0 6
|
|
ms41.hinet.net 6 0 0 0 0 0 0 0 0 0 6
|
|
osn.de 5 0 0 0 0 0 1 0 0 0 4
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<ul>
|
|
|
|
<li> <p> The "T" column shows the total (in this case sender) count
|
|
for each domain. The columns with numbers above them, show counts
|
|
for messages aged fewer than that many minutes, but not younger
|
|
than the age limit for the previous column. The row labeled "TOTAL"
|
|
shows the total count for all domains. </p>
|
|
|
|
<li> <p> In this example, there are 14 messages allegedly from
|
|
yahoo.com, 1 between 10 and 20 minutes old, 1 between 320 and 640
|
|
minutes old and 12 older than 1280 minutes (1440 minutes in a day).
|
|
</p>
|
|
|
|
</ul>
|
|
|
|
<p> When the output is a terminal intermediate results showing the top 20
|
|
domains (-n option) are displayed after every 1000 messages (-N option)
|
|
and the final output also shows only the top 20 domains. This makes
|
|
qshape useful even when the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> is very large and it may
|
|
otherwise take prohibitively long to read the entire "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>. </p>
|
|
|
|
<p> By default, qshape shows statistics for the union of both the
|
|
"<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queues</a> which are the most relevant queues to
|
|
look at when analyzing performance. </p>
|
|
|
|
<p> One can request an alternate list of queues: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape deferred
|
|
$ qshape incoming active deferred
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> this will show the age distribution of the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> or
|
|
the union of the "<a href="QSHAPE_README.html#incoming_queue">incoming"</a>, "<a href="QSHAPE_README.html#active_queue">active"</a> and "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a>. </p>
|
|
|
|
<p> Command line options control the number of display "buckets",
|
|
the age limit for the smallest bucket, display of parent domain
|
|
counts and so on. The "-h" option outputs a summary of the available
|
|
switches. </p>
|
|
|
|
<h2><a name="trouble_shooting">Trouble shooting with qshape</a>
|
|
</h2>
|
|
|
|
<p> Large numbers in the qshape output represent a large number of
|
|
messages that are destined to (or alleged to come from) a particular
|
|
domain. It should be possible to tell at a glance which domains
|
|
dominate the queue sender or recipient counts, approximately when
|
|
a burst of mail started, and when it stopped. </p>
|
|
|
|
<p> The problem destinations or sender domains appear near the top
|
|
left corner of the output table. Remember that the "<a href="QSHAPE_README.html#active_queue">active" queue</a>
|
|
can accommodate up to 20000 ($<a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a>) messages.
|
|
To check whether this limit has been reached, use: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape -s active <i>(show sender statistics)</i>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> If the total sender count is below 20000 the "<a href="QSHAPE_README.html#active_queue">active" queue</a> is
|
|
not yet saturated, any high volume sender domains show near the
|
|
top of the output.
|
|
|
|
<p> With <a href="qmgr.8.html">oqmgr(8)</a> the "<a href="QSHAPE_README.html#active_queue">active" queue</a> is also limited to at most 20000
|
|
recipient addresses ($<a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>). To check for
|
|
exhaustion of this limit use: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape active <i>(show recipient statistics)</i>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> Having found the high volume domains, it is often useful to
|
|
search the logs for recent messages pertaining to the domains in
|
|
question. </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
# Find deliveries to example.com
|
|
#
|
|
$ tail -10000 /var/log/maillog |
|
|
grep -E -i ': to=<.*@example\.com>,' |
|
|
less
|
|
|
|
# Find messages from example.com
|
|
#
|
|
$ tail -10000 /var/log/maillog |
|
|
grep -E -i ': from=<.*@example\.com>,' |
|
|
less
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> You may want to drill in on some specific queue ids: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
# Find all messages for a specific queue id.
|
|
#
|
|
$ tail -10000 /var/log/maillog | grep -E ': 2B2173FF68: '
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> Also look for queue manager warning messages in the log. These
|
|
warnings can suggest strategies to reduce congestion. </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ grep -E 'qmgr.*(panic|fatal|error|warning):' /var/log/maillog
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> When all else fails try the Postfix mailing list for help, but
|
|
please don't forget to include the top 10 or 20 lines of <a href="qshape.1.html">qshape(1)</a>
|
|
output. </p>
|
|
|
|
<h2><a name="healthy">Example 1: Healthy queue</a></h2>
|
|
|
|
<p> When looking at just the "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queues</a>, under
|
|
normal conditions (no congestion) the "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queues</a>
|
|
are nearly empty. Mail leaves the system almost as quickly as it
|
|
comes in or is deferred without congestion in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>.
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape <i>(show "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queue</a> status)</i>
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 5 0 0 0 1 0 0 0 1 1 2
|
|
meri.uwasa.fi 5 0 0 0 1 0 0 0 1 1 2
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> If one looks at the two queues separately, the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>
|
|
is empty or perhaps briefly has one or two messages, while the
|
|
"<a href="QSHAPE_README.html#active_queue">active" queue</a> holds more messages and for a somewhat longer time:
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape incoming
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 0 0 0 0 0 0 0 0 0 0 0
|
|
|
|
$ qshape active
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 5 0 0 0 1 0 0 0 1 1 2
|
|
meri.uwasa.fi 5 0 0 0 1 0 0 0 1 1 2
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<h2><a name="dictionary_bounce">Example 2: Deferred queue full of
|
|
dictionary attack bounces</a></h2>
|
|
|
|
<p> This is from a server where recipient validation is not yet
|
|
available for some of the <a href="VIRTUAL_README.html#canonical">hosted domains</a>. Dictionary attacks on
|
|
the unvalidated domains result in bounce backscatter. The bounces
|
|
dominate the queue, but with proper tuning they do not saturate the
|
|
"<a href="QSHAPE_README.html#incoming_queue">incoming"</a> or "<a href="QSHAPE_README.html#active_queue">active" queues</a>. The high volume of deferred mail is not
|
|
a direct cause for alarm. </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape deferred | head
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 2234 4 2 5 9 31 57 108 201 464 1353
|
|
heyhihellothere.com 207 0 0 1 1 6 6 8 25 68 92
|
|
pleazerzoneprod.com 105 0 0 0 0 0 0 0 5 44 56
|
|
groups.msn.com 63 2 1 2 4 4 14 14 14 8 0
|
|
orion.toppoint.de 49 0 0 0 1 0 2 4 3 16 23
|
|
kali.com.cn 46 0 0 0 0 1 0 2 6 12 25
|
|
meri.uwasa.fi 44 0 0 0 0 1 0 2 8 11 22
|
|
gjr.paknet.com.pk 43 1 0 0 1 1 3 3 6 12 16
|
|
aristotle.algonet.se 41 0 0 0 0 0 1 2 11 12 15
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> The domains shown are mostly bulk-mailers and all the volume
|
|
is the tail end of the time distribution, showing that short term
|
|
arrival rates are moderate. Larger numbers and lower message ages
|
|
are more indicative of current trouble. Old mail still going nowhere
|
|
is largely harmless so long as the "<a href="QSHAPE_README.html#active_queue">active"</a> and "<a href="QSHAPE_README.html#incoming_queue">incoming" queues</a> are
|
|
short. We can also see that the groups.msn.com undeliverables are
|
|
low rate steady stream rather than a concentrated dictionary attack
|
|
that is now over. </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape -s deferred | head
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 2193 4 4 5 8 33 56 104 205 465 1309
|
|
MAILER-DAEMON 1709 4 4 5 8 33 55 101 198 452 849
|
|
example.com 263 0 0 0 0 0 0 0 0 2 261
|
|
example.org 209 0 0 0 0 0 1 3 6 11 188
|
|
example.net 6 0 0 0 0 0 0 0 0 0 6
|
|
example.edu 3 0 0 0 0 0 0 0 0 0 3
|
|
example.gov 2 0 0 0 0 0 0 0 1 0 1
|
|
example.mil 1 0 0 0 0 0 0 0 0 0 1
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> Looking at the sender distribution, we see that as expected
|
|
most of the messages are bounces. </p>
|
|
|
|
<h2><a name="active_congestion">Example 3: Congestion in the active
|
|
queue</a></h2>
|
|
|
|
<p> This example is taken from a Feb 2004 discussion on the Postfix
|
|
Users list. Congestion was reported with the
|
|
"<a href="QSHAPE_README.html#active_queue">active"</a> and "<a href="QSHAPE_README.html#incoming_queue">incoming" queues</a>
|
|
large and not shrinking despite very large delivery agent
|
|
process limits. The thread is archived at:
|
|
<a href="https://web.archive.org/web/20120227170207/http://archives.neohapsis.com/archives/postfix/2004-02/thread.html#1371">https://web.archive.org/web/20120227170207/http://archives.neohapsis.com/archives/postfix/2004-02/thread.html#1371</a>
|
|
</p>
|
|
|
|
<p> Using an older version of <a href="qshape.1.html">qshape(1)</a> it was quickly determined
|
|
that all the messages were for just a few destinations: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape <i>(show "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queue</a> status)</i>
|
|
|
|
T A 5 10 20 40 80 160 320 320+
|
|
TOTAL 11775 9996 0 0 1 1 42 94 221 1420
|
|
user.sourceforge.net 7678 7678 0 0 0 0 0 0 0 0
|
|
lists.sourceforge.net 2313 2313 0 0 0 0 0 0 0 0
|
|
gzd.gotdns.com 102 0 0 0 0 0 0 0 2 100
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> The "A" column showed the count of messages in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>,
|
|
and the numbered columns showed totals for the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>. At
|
|
10000 messages (Postfix 1.x "<a href="QSHAPE_README.html#active_queue">active" queue</a> size limit) the "<a href="QSHAPE_README.html#active_queue">active" queue</a>
|
|
is full. The "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> was growing rapidly. </p>
|
|
|
|
<p> With the trouble destinations clearly identified, the administrator
|
|
quickly found and fixed the problem. It is substantially harder to
|
|
glean the same information from the logs. While a careful reading
|
|
of <a href="mailq.1.html">mailq(1)</a> output should yield similar results, it is much harder
|
|
to gauge the magnitude of the problem by looking at the queue
|
|
one message at a time. </p>
|
|
|
|
<h2><a name="backlog">Example 4: High volume destination backlog</a></h2>
|
|
|
|
<p> When a site you send a lot of email to is down or slow, mail
|
|
messages will rapidly build up in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>, or worse, in
|
|
the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. The qshape output will show large numbers for
|
|
the destination domain in all age buckets that overlap the starting
|
|
time of the problem: </p>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
$ qshape deferred | head
|
|
|
|
T 5 10 20 40 80 160 320 640 1280 1280+
|
|
TOTAL 5000 200 200 400 800 1600 1000 200 200 200 200
|
|
highvolume.com 4000 160 160 320 640 1280 1440 0 0 0 0
|
|
...
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p> Here the "highvolume.com" destination is continuing to accumulate
|
|
deferred mail. The "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#active_queue">active" queues</a> are fine, but the
|
|
"<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> started growing some time between 1 and 2 hours ago
|
|
and continues to grow. </p>
|
|
|
|
<p> If the high volume destination is not down, but is instead
|
|
slow, one might see similar congestion in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>.
|
|
"<a href="QSHAPE_README.html#active_queue">Active" queue</a> congestion is a greater cause for alarm; one might need to
|
|
take measures to ensure that the mail is deferred instead or even
|
|
add an <a href="access.5.html">access(5)</a> rule asking the sender to try again later. </p>
|
|
|
|
<p> If a high volume destination exhibits frequent bursts of consecutive
|
|
connections refused by all MX hosts or "421 Server busy errors", it
|
|
is possible for the queue manager to mark the destination as "dead"
|
|
despite the transient nature of the errors. The destination will be
|
|
retried again after the expiration of a $<a href="postconf.5.html#minimal_backoff_time">minimal_backoff_time</a> timer.
|
|
If the error bursts are frequent enough it may be that only a small
|
|
quantity of email is delivered before the destination is again marked
|
|
"dead". In some cases enabling static (not on demand) connection
|
|
caching by listing the appropriate nexthop domain in a table included in
|
|
"<a href="postconf.5.html#smtp_connection_cache_destinations">smtp_connection_cache_destinations</a>" may help to reduce the error rate,
|
|
because most messages will re-use existing connections. </p>
|
|
|
|
<p> The MTA that has been observed most frequently to exhibit such
|
|
bursts of errors is Microsoft Exchange, which refuses connections
|
|
under load. Some proxy virus scanners in front of the Exchange
|
|
server propagate the refused connection to the client as a "421"
|
|
error. </p>
|
|
|
|
<p> Note that it is now possible to configure Postfix to exhibit similarly
|
|
erratic behavior by misconfiguring the <a href="anvil.8.html">anvil(8)</a> service. Do not use
|
|
<a href="anvil.8.html">anvil(8)</a> for steady-state rate limiting, its purpose is (unintentional)
|
|
DoS prevention and the rate limits set should be very generous! </p>
|
|
|
|
<p> If one finds oneself needing to deliver a high volume of mail to a
|
|
destination that exhibits frequent brief bursts of errors and connection
|
|
caching does not solve the problem, there is a subtle workaround. </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> Postfix version 2.5 and later: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> In <a href="master.5.html">master.cf</a> set up a dedicated clone of the "smtp" transport
|
|
for the destination in question. In the example below we will call
|
|
it "fragile". </p>
|
|
|
|
<li> <p> In <a href="master.5.html">master.cf</a> configure a reasonable process limit for the
|
|
cloned smtp transport (a number in the 10-20 range is typical). </p>
|
|
|
|
<li> <p> IMPORTANT!!! In <a href="postconf.5.html">main.cf</a> configure a large per-destination
|
|
pseudo-cohort failure limit for the cloned smtp transport. </p>
|
|
|
|
<pre>
|
|
/etc/postfix/<a href="postconf.5.html">main.cf</a>:
|
|
<a href="postconf.5.html#transport_maps">transport_maps</a> = <a href="DATABASE_README.html#types">hash</a>:/etc/postfix/transport
|
|
fragile_destination_concurrency_failed_cohort_limit = 100
|
|
fragile_destination_concurrency_limit = 20
|
|
|
|
/etc/postfix/transport:
|
|
example.com fragile:
|
|
|
|
/etc/postfix/<a href="master.5.html">master.cf</a>:
|
|
# service type private unpriv chroot wakeup maxproc command
|
|
fragile unix - - n - 20 smtp
|
|
</pre>
|
|
|
|
<p> See also the documentation for
|
|
<a href="postconf.5.html#default_destination_concurrency_failed_cohort_limit">default_destination_concurrency_failed_cohort_limit</a> and
|
|
<a href="postconf.5.html#default_destination_concurrency_limit">default_destination_concurrency_limit</a>. </p>
|
|
|
|
</ul>
|
|
|
|
<li> <p> Earlier Postfix versions: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> In <a href="master.5.html">master.cf</a> set up a dedicated clone of the "smtp"
|
|
transport for the destination in question. In the example below
|
|
we will call it "fragile". </p>
|
|
|
|
<li> <p> In <a href="master.5.html">master.cf</a> configure a reasonable process limit for the
|
|
transport (a number in the 10-20 range is typical). </p>
|
|
|
|
<li> <p> IMPORTANT!!! In <a href="postconf.5.html">main.cf</a> configure a very large initial
|
|
and destination concurrency limit for this transport (say 2000). </p>
|
|
|
|
<pre>
|
|
/etc/postfix/<a href="postconf.5.html">main.cf</a>:
|
|
<a href="postconf.5.html#transport_maps">transport_maps</a> = <a href="DATABASE_README.html#types">hash</a>:/etc/postfix/transport
|
|
<a href="postconf.5.html#initial_destination_concurrency">initial_destination_concurrency</a> = 2000
|
|
fragile_destination_concurrency_limit = 2000
|
|
|
|
/etc/postfix/transport:
|
|
example.com fragile:
|
|
|
|
/etc/postfix/<a href="master.5.html">master.cf</a>:
|
|
# service type private unpriv chroot wakeup maxproc command
|
|
fragile unix - - n - 20 smtp
|
|
</pre>
|
|
|
|
<p> See also the documentation for <a href="postconf.5.html#default_destination_concurrency_limit">default_destination_concurrency_limit</a>.
|
|
</p>
|
|
|
|
</ul>
|
|
|
|
</ul>
|
|
|
|
<p> The effect of this configuration is that up to 2000
|
|
consecutive errors are tolerated without marking the destination
|
|
dead, while the total concurrency remains reasonable (10-20
|
|
processes). This trick is only for a very specialized situation:
|
|
high volume delivery into a channel with multi-error bursts
|
|
that is capable of high throughput, but is repeatedly throttled by
|
|
the bursts of errors. </p>
|
|
|
|
<p> When a destination is unable to handle the load even after the
|
|
Postfix process limit is reduced to 1, a desperate measure is to
|
|
insert brief delays between delivery attempts. </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> Postfix version 2.5 and later: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> In <a href="master.5.html">master.cf</a> set up a dedicated clone of the "smtp" transport
|
|
for the problem destination. In the example below we call it "slow".
|
|
</p>
|
|
|
|
<li> <p> In <a href="postconf.5.html">main.cf</a> configure a short delay between deliveries to
|
|
the same destination. </p>
|
|
|
|
<pre>
|
|
/etc/postfix/<a href="postconf.5.html">main.cf</a>:
|
|
<a href="postconf.5.html#transport_maps">transport_maps</a> = <a href="DATABASE_README.html#types">hash</a>:/etc/postfix/transport
|
|
slow_destination_rate_delay = 1
|
|
slow_destination_concurrency_failed_cohort_limit = 100
|
|
|
|
/etc/postfix/transport:
|
|
example.com slow:
|
|
|
|
/etc/postfix/<a href="master.5.html">master.cf</a>:
|
|
# service type private unpriv chroot wakeup maxproc command
|
|
slow unix - - n - - smtp
|
|
</pre>
|
|
|
|
</ul>
|
|
|
|
<p> See also the documentation for <a href="postconf.5.html#default_destination_rate_delay">default_destination_rate_delay</a>. </p>
|
|
|
|
<p> This solution forces the Postfix <a href="smtp.8.html">smtp(8)</a> client to wait for
|
|
$slow_destination_rate_delay seconds between deliveries to the same
|
|
destination. </p>
|
|
|
|
<p> IMPORTANT!! The large slow_destination_concurrency_failed_cohort_limit
|
|
value is needed. This prevents Postfix from deferring all mail for
|
|
the same destination after only one connection or handshake error
|
|
(the reason for this is that non-zero slow_destination_rate_delay
|
|
forces a per-destination concurrency of 1). </p>
|
|
|
|
<li> <p> Earlier Postfix versions: </p>
|
|
|
|
<ul>
|
|
|
|
<li> <p> In the transport map entry for the problem destination,
|
|
specify a dead host as the primary nexthop. </p>
|
|
|
|
<li> <p> In the <a href="master.5.html">master.cf</a> entry for the transport specify the
|
|
problem destination as the <a href="postconf.5.html#fallback_relay">fallback_relay</a> and specify a small
|
|
<a href="postconf.5.html#smtp_connect_timeout">smtp_connect_timeout</a> value. </p>
|
|
|
|
<pre>
|
|
/etc/postfix/<a href="postconf.5.html">main.cf</a>:
|
|
<a href="postconf.5.html#transport_maps">transport_maps</a> = <a href="DATABASE_README.html#types">hash</a>:/etc/postfix/transport
|
|
|
|
/etc/postfix/transport:
|
|
example.com slow:[dead.host]
|
|
|
|
/etc/postfix/<a href="master.5.html">master.cf</a>:
|
|
# service type private unpriv chroot wakeup maxproc command
|
|
slow unix - - n - 1 smtp
|
|
-o <a href="postconf.5.html#fallback_relay">fallback_relay</a>=problem.example.com
|
|
-o <a href="postconf.5.html#smtp_connect_timeout">smtp_connect_timeout</a>=1
|
|
-o <a href="postconf.5.html#smtp_connection_cache_on_demand">smtp_connection_cache_on_demand</a>=no
|
|
</pre>
|
|
|
|
</ul>
|
|
|
|
<p> This solution forces the Postfix <a href="smtp.8.html">smtp(8)</a> client to wait for
|
|
$<a href="postconf.5.html#smtp_connect_timeout">smtp_connect_timeout</a> seconds between deliveries. The connection
|
|
caching feature is disabled to prevent the client from skipping
|
|
over the dead host. </p>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="queues">Postfix queue directories</a></h2>
|
|
|
|
<p> The following sections describe Postfix queues: their purpose,
|
|
what normal behavior looks like, and how to diagnose abnormal
|
|
behavior. </p>
|
|
|
|
<h3> <a name="maildrop_queue"> The "maildrop" queue </a> </h3>
|
|
|
|
<p> Messages that have been submitted via the Postfix <a href="sendmail.1.html">sendmail(1)</a>
|
|
command, but not yet brought into the main Postfix queue by the
|
|
<a href="pickup.8.html">pickup(8)</a> service, await processing in the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a>. Messages
|
|
can be added to the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a> even when the Postfix system
|
|
is not running. They will begin to be processed once Postfix is
|
|
started. </p>
|
|
|
|
<p> The "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a> is drained by the single threaded <a href="pickup.8.html">pickup(8)</a>
|
|
service scanning the queue directory periodically or when notified
|
|
of new message arrival by the <a href="postdrop.1.html">postdrop(1)</a> program. The <a href="postdrop.1.html">postdrop(1)</a>
|
|
program is a setgid helper that allows the unprivileged Postfix
|
|
<a href="sendmail.1.html">sendmail(1)</a> program to inject mail into the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a> and
|
|
to notify the <a href="pickup.8.html">pickup(8)</a> service of its arrival. </p>
|
|
|
|
<p> All mail that enters the main Postfix queue does so via the
|
|
<a href="cleanup.8.html">cleanup(8)</a> service. The cleanup service is responsible for envelope
|
|
and header rewriting, header and body regular expression checks,
|
|
automatic bcc recipient processing, milter content processing, and
|
|
reliable insertion of the message into the Postfix "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>. </p>
|
|
|
|
<p> In the absence of excessive CPU consumption in <a href="cleanup.8.html">cleanup(8)</a> header
|
|
or body regular expression checks or other software consuming all
|
|
available CPU resources, Postfix performance is disk I/O bound.
|
|
The rate at which the <a href="pickup.8.html">pickup(8)</a> service can inject messages into
|
|
the queue is largely determined by disk access times, since the
|
|
<a href="cleanup.8.html">cleanup(8)</a> service must commit the message to stable storage before
|
|
returning success. The same is true of the <a href="postdrop.1.html">postdrop(1)</a> program
|
|
writing the message to the "maildrop" directory. </p>
|
|
|
|
<p> As the pickup service is single threaded, it can only deliver
|
|
one message at a time at a rate that does not exceed the reciprocal
|
|
disk I/O latency (+ CPU if not negligible) of the cleanup service.
|
|
</p>
|
|
|
|
<p> Congestion in this queue is indicative of an excessive local message
|
|
submission rate or perhaps excessive CPU consumption in the <a href="cleanup.8.html">cleanup(8)</a>
|
|
service due to excessive <a href="postconf.5.html#body_checks">body_checks</a>, or (Postfix ≥ 2.3) high latency
|
|
milters. </p>
|
|
|
|
<p> Note, that once the "<a href="QSHAPE_README.html#active_queue">active" queue</a> is full, the cleanup service
|
|
will attempt to slow down message injection by pausing $<a href="postconf.5.html#in_flow_delay">in_flow_delay</a>
|
|
for each message. In this case "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a> congestion may be
|
|
a consequence of congestion downstream, rather than a problem in
|
|
its own right. </p>
|
|
|
|
<p> Note, you should not attempt to deliver large volumes of mail via
|
|
the <a href="pickup.8.html">pickup(8)</a> service. High volume sites should avoid using "simple"
|
|
content filters that re-inject scanned mail via Postfix <a href="sendmail.1.html">sendmail(1)</a>
|
|
and <a href="postdrop.1.html">postdrop(1)</a>. </p>
|
|
|
|
<p> A high arrival rate of locally submitted mail may be an indication
|
|
of an uncaught forwarding loop, or a run-away notification program.
|
|
Try to keep the volume of local mail injection to a moderate level.
|
|
</p>
|
|
|
|
<p> The "postsuper -r" command can place selected messages into
|
|
the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a> for reprocessing. This is most useful for
|
|
resetting any stale <a href="postconf.5.html#content_filter">content_filter</a> settings. Requeuing a large number
|
|
of messages using "postsuper -r" can clearly cause a spike in the
|
|
size of the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a>. </p>
|
|
|
|
<h3> <a name="hold_queue"> The "hold" queue </a> </h3>
|
|
|
|
<p> The administrator can define "smtpd" <a href="access.5.html">access(5)</a> policies, or
|
|
<a href="cleanup.8.html">cleanup(8)</a> header/body checks that cause messages to be automatically
|
|
diverted from normal processing and placed indefinitely in the
|
|
"<a href="QSHAPE_README.html#hold_queue">hold" queue</a>. Messages placed in the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a> stay there until
|
|
the administrator intervenes. No periodic delivery attempts are
|
|
made for messages in the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a>. The <a href="postsuper.1.html">postsuper(1)</a> command
|
|
can be used to manually release messages into the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>.
|
|
</p>
|
|
|
|
<p> Messages can potentially stay in the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a> longer than
|
|
$<a href="postconf.5.html#maximal_queue_lifetime">maximal_queue_lifetime</a>. If such "old" messages need to be released from
|
|
the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a>, they should typically be moved into the "<a href="QSHAPE_README.html#maildrop_queue">maildrop" queue</a>
|
|
using "postsuper -r", so that the message gets a new timestamp and
|
|
is given more than one opportunity to be delivered. Messages that are
|
|
"young" can be moved directly into the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> using
|
|
"postsuper -H". </p>
|
|
|
|
<p> The "<a href="QSHAPE_README.html#hold_queue">hold" queue</a> plays little role in Postfix performance, and
|
|
monitoring of the "<a href="QSHAPE_README.html#hold_queue">hold" queue</a> is typically more closely motivated
|
|
by tracking spam and malware, than by performance issues. </p>
|
|
|
|
<h3> <a name="incoming_queue"> The "incoming" queue </a> </h3>
|
|
|
|
<p> All new mail entering the Postfix queue is written by the
|
|
<a href="cleanup.8.html">cleanup(8)</a> service into the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>. New queue files are
|
|
created owned by the "postfix" user with an access bitmask (or
|
|
mode) of 0600. Once a queue file is ready for further processing
|
|
the <a href="cleanup.8.html">cleanup(8)</a> service changes the queue file mode to 0700 and
|
|
notifies the queue manager of new mail arrival. The queue manager
|
|
ignores incomplete queue files whose mode is 0600, as these are
|
|
still being written by cleanup. </p>
|
|
|
|
<p> The queue manager scans the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> bringing any new
|
|
mail into the "<a href="QSHAPE_README.html#active_queue">active" queue</a> if the "<a href="QSHAPE_README.html#active_queue">active" queue</a> resource limits
|
|
have not been exceeded. By default, the "<a href="QSHAPE_README.html#active_queue">active" queue</a> accommodates
|
|
at most 20000 messages. Once the "<a href="QSHAPE_README.html#active_queue">active" queue</a> message limit is
|
|
reached, the queue manager stops scanning the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>
|
|
(and the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>, see below). </p>
|
|
|
|
<p> Under normal conditions the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> is nearly empty (has
|
|
only mode 0600 files), with the queue manager able to import new
|
|
messages into the "<a href="QSHAPE_README.html#active_queue">active" queue</a> as soon as they become available.
|
|
</p>
|
|
|
|
<p> The "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> grows when the message input rate spikes
|
|
above the rate at which the queue manager can import messages into
|
|
the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. The main factors slowing down the queue manager
|
|
are disk I/O and lookup queries to the trivial-rewrite service. If the queue
|
|
manager is routinely not keeping up, consider not using "slow"
|
|
lookup services (MySQL, LDAP, ...) for transport lookups or speeding
|
|
up the hosts that provide the lookup service. If the problem is I/O
|
|
starvation, consider striping the queue over more disks, faster controllers
|
|
with a battery write cache, or other hardware improvements. At the very
|
|
least, make sure that the queue directory is mounted with the "noatime"
|
|
option if applicable to the underlying filesystem. </p>
|
|
|
|
<p> The <a href="postconf.5.html#in_flow_delay">in_flow_delay</a> parameter is used to clamp the input rate
|
|
when the queue manager starts to fall behind. The <a href="cleanup.8.html">cleanup(8)</a> service
|
|
will pause for $<a href="postconf.5.html#in_flow_delay">in_flow_delay</a> seconds before creating a new queue
|
|
file if it cannot obtain a "token" from the queue manager. </p>
|
|
|
|
<p> Since the number of <a href="cleanup.8.html">cleanup(8)</a> processes is limited in most
|
|
cases by the SMTP server concurrency, the input rate can exceed
|
|
the output rate by at most "SMTP connection count" / $<a href="postconf.5.html#in_flow_delay">in_flow_delay</a>
|
|
messages per second. </p>
|
|
|
|
<p> With a default process limit of 100, and an <a href="postconf.5.html#in_flow_delay">in_flow_delay</a> of
|
|
1s, the coupling is strong enough to limit a single run-away injector
|
|
to 1 message per second, but is not strong enough to deflect an
|
|
excessive input rate from many sources at the same time. </p>
|
|
|
|
<p> If a server is being hammered from multiple directions, consider
|
|
raising the <a href="postconf.5.html#in_flow_delay">in_flow_delay</a> to 10 seconds, but only if the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>
|
|
is growing even while the "<a href="QSHAPE_README.html#active_queue">active" queue</a> is not full and the
|
|
trivial-rewrite service is using a fast transport lookup mechanism.
|
|
</p>
|
|
|
|
<h3> <a name="active_queue"> The "active" queue </a> </h3>
|
|
|
|
<p> The queue manager is a delivery agent scheduler; it works to
|
|
ensure fast and fair delivery of mail to all destinations within
|
|
designated resource limits. </p>
|
|
|
|
<p> The "<a href="QSHAPE_README.html#active_queue">active" queue</a> is somewhat analogous to an operating system's
|
|
process run queue. Messages in the "<a href="QSHAPE_README.html#active_queue">active" queue</a> are ready to be
|
|
sent (runnable), but are not necessarily in the process of being
|
|
sent (running). </p>
|
|
|
|
<p> While most Postfix administrators think of the "<a href="QSHAPE_README.html#active_queue">active" queue</a>
|
|
as a directory on disk, the real "<a href="QSHAPE_README.html#active_queue">active" queue</a> is a set of data
|
|
structures in the memory of the queue manager process. </p>
|
|
|
|
<p> Messages in the "<a href="QSHAPE_README.html#maildrop_queue">maildrop"</a>, "<a href="QSHAPE_README.html#hold_queue">hold"</a>, "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> and "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a>
|
|
(see below) do not occupy memory; they are safely stored on
|
|
disk waiting for their turn to be processed. The envelope information
|
|
for messages in the "<a href="QSHAPE_README.html#active_queue">active" queue</a> is managed in memory, allowing
|
|
the queue manager to do global scheduling, allocating available
|
|
delivery agent processes to an appropriate message in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. </p>
|
|
|
|
<p> Within the "<a href="QSHAPE_README.html#active_queue">active" queue</a>, (multi-recipient) messages are broken
|
|
up into groups of recipients that share the same transport/nexthop
|
|
combination; the group size is capped by the transport's recipient
|
|
concurrency limit. </p>
|
|
|
|
<p> Multiple recipient groups (from one or more messages) are queued
|
|
for delivery grouped by transport/nexthop combination. The
|
|
<b>destination</b> concurrency limit for the transports caps the number
|
|
of simultaneous delivery attempts for each nexthop. Transports with
|
|
a <b>recipient</b> concurrency limit of 1 are special: these are grouped
|
|
by the actual recipient address rather than the nexthop, yielding
|
|
per-recipient concurrency limits rather than per-domain
|
|
concurrency limits. Per-recipient limits are appropriate when
|
|
performing final delivery to mailboxes rather than when relaying
|
|
to a remote server. </p>
|
|
|
|
<p> Congestion occurs in the "<a href="QSHAPE_README.html#active_queue">active" queue</a> when one or more destinations
|
|
drain slower than the corresponding message input rate. </p>
|
|
|
|
<p> Input into the "<a href="QSHAPE_README.html#active_queue">active" queue</a> comes both from new mail in the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>,
|
|
and retries of mail in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>. Should the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
get really large, retries of old mail can dominate the arrival
|
|
rate of new mail. Systems with more CPU, faster disks and more network
|
|
bandwidth can deal with larger "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a>, but as a rule of thumb
|
|
the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> scales to somewhere between 100,000 and 1,000,000
|
|
messages with good performance unlikely above that "limit". Systems with
|
|
queues this large should typically stop accepting new mail, or put the
|
|
backlog "on hold" until the underlying issue is fixed (provided that
|
|
there is enough capacity to handle just the new mail). </p>
|
|
|
|
<p> When a destination is down for some time, the queue manager will
|
|
mark it dead, and immediately defer all mail for the destination without
|
|
trying to assign it to a delivery agent. In this case the messages
|
|
will quickly leave the "<a href="QSHAPE_README.html#active_queue">active" queue</a> and end up in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
(with Postfix < 2.4, this is done directly by the queue manager,
|
|
with Postfix ≥ 2.4 this is done via the "retry" delivery agent). </p>
|
|
|
|
<p> When the destination is instead simply slow, or there is a problem
|
|
causing an excessive arrival rate the "<a href="QSHAPE_README.html#active_queue">active" queue</a> will grow and will
|
|
become dominated by mail to the congested destination. </p>
|
|
|
|
<p> The only way to reduce congestion is to either reduce the input
|
|
rate or increase the throughput. Increasing the throughput requires
|
|
either increasing the concurrency or reducing the latency of
|
|
deliveries. </p>
|
|
|
|
<p> For high volume sites a key tuning parameter is the number of
|
|
"smtp" delivery agents allocated to the "smtp" and "relay" transports.
|
|
High volume sites tend to send to many different destinations, many
|
|
of which may be down or slow, so a good fraction of the available
|
|
delivery agents will be blocked waiting for slow sites. Also mail
|
|
destined across the globe will incur large SMTP command-response
|
|
latencies, so high message throughput can only be achieved with
|
|
more concurrent delivery agents. </p>
|
|
|
|
<p> The default "smtp" process limit of 100 is good enough for most
|
|
sites, and may even need to be lowered for sites with low bandwidth
|
|
connections (no use increasing concurrency once the network pipe
|
|
is full). When one finds that the queue is growing on an "idle"
|
|
system (CPU, disk I/O and network not exhausted) the remaining
|
|
reason for congestion is insufficient concurrency in the face of
|
|
a high average latency. If the number of outbound SMTP connections
|
|
(either ESTABLISHED or SYN_SENT) reaches the process limit, mail
|
|
is draining slowly and the system and network are not loaded, raise
|
|
the "smtp" and/or "relay" process limits! </p>
|
|
|
|
<p> When a high volume destination is served by multiple MX hosts with
|
|
typically low delivery latency, performance can suffer dramatically when
|
|
one of the MX hosts is unresponsive and SMTP connections to that host
|
|
timeout. For example, if there are 2 equal weight MX hosts, the SMTP
|
|
connection timeout is 30 seconds and one of the MX hosts is down, the
|
|
average SMTP connection will take approximately 15 seconds to complete.
|
|
With a default per-destination concurrency limit of 20 connections,
|
|
throughput falls to just over 1 message per second. </p>
|
|
|
|
<p> The best way to avoid bottlenecks when one or more MX hosts is
|
|
non-responsive is to use connection caching. Connection caching was
|
|
introduced with Postfix 2.2 and is by default enabled on demand for
|
|
destinations with a backlog of mail in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. When connection
|
|
caching is in effect for a particular destination, established connections
|
|
are re-used to send additional messages, this reduces the number of
|
|
connections made per message delivery and maintains good throughput even
|
|
in the face of partial unavailability of the destination's MX hosts. </p>
|
|
|
|
<p> If connection caching is not available (Postfix < 2.2) or does
|
|
not provide a sufficient latency reduction, especially for the "relay"
|
|
transport used to forward mail to "your own" domains, consider setting
|
|
lower than default SMTP connection timeouts (1-5 seconds) and higher
|
|
than default destination concurrency limits. This will further reduce
|
|
latency and provide more concurrency to maintain throughput should
|
|
latency rise. </p>
|
|
|
|
<p> Setting high concurrency limits to domains that are not your own may
|
|
be viewed as hostile by the receiving system, and steps may be taken
|
|
to prevent you from monopolizing the destination system's resources.
|
|
The defensive measures may substantially reduce your throughput or block
|
|
access entirely. Do not set aggressive concurrency limits to remote
|
|
domains without coordinating with the administrators of the target
|
|
domain. </p>
|
|
|
|
<p> If necessary, dedicate and tune custom transports for selected high
|
|
volume destinations. The "relay" transport is provided for forwarding mail
|
|
to domains for which your server is a primary or backup MX host. These can
|
|
make up a substantial fraction of your email traffic. Use the "relay" and
|
|
not the "smtp" transport to send email to these domains. Using the "relay"
|
|
transport allocates a separate delivery agent pool to these destinations
|
|
and allows separate tuning of timeouts and concurrency limits. </p>
|
|
|
|
<p> Another common cause of congestion is unwarranted flushing of the
|
|
entire "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>. The "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> holds messages that are likely
|
|
to fail to be delivered and are also likely to be slow to fail delivery
|
|
(time out). As a result the most common reaction to a large "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
(flush it!) is more than likely counter-productive, and typically makes
|
|
the congestion worse. Do not flush the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> unless you expect
|
|
that most of its content has recently become deliverable (e.g. <a href="postconf.5.html#relayhost">relayhost</a>
|
|
back up after an outage)! </p>
|
|
|
|
<p> Note that whenever the queue manager is restarted, there may
|
|
already be messages in the "<a href="QSHAPE_README.html#active_queue">active" queue</a> directory, but the "real"
|
|
"<a href="QSHAPE_README.html#active_queue">active" queue</a> in memory is empty. In order to recover the in-memory
|
|
state, the queue manager moves all the "<a href="QSHAPE_README.html#active_queue">active" queue</a> messages
|
|
back into the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>, and then uses its normal "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>
|
|
scan to refill the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. The process of moving all
|
|
the messages back and forth, redoing transport table (<a href="trivial-rewrite.8.html">trivial-rewrite(8)</a>
|
|
resolve service) lookups, and re-importing the messages back into
|
|
memory is expensive. At all costs, avoid frequent restarts of the
|
|
queue manager (e.g. via frequent execution of "postfix reload"). </p>
|
|
|
|
<h3> <a name="deferred_queue"> The "deferred" queue </a> </h3>
|
|
|
|
<p> When all the deliverable recipients for a message are delivered,
|
|
and for some recipients delivery failed for a transient reason (it
|
|
might succeed later), the message is placed in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>.
|
|
</p>
|
|
|
|
<p> The queue manager scans the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> periodically. The scan
|
|
interval is controlled by the <a href="postconf.5.html#queue_run_delay">queue_run_delay</a> parameter. While a "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
scan is in progress, if an "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> scan is also in progress
|
|
(ideally these are brief since the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a> should be short), the
|
|
queue manager alternates between looking for messages in the "<a href="QSHAPE_README.html#incoming_queue">incoming" queue</a>
|
|
and in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>. This "round-robin" strategy prevents
|
|
starvation of either the "<a href="QSHAPE_README.html#incoming_queue">incoming"</a> or the "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a>. </p>
|
|
|
|
<p> Each "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> scan only brings a fraction of the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
back into the "<a href="QSHAPE_README.html#active_queue">active" queue</a> for a retry. This is because each
|
|
message in the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> is assigned a "cool-off" time when
|
|
it is deferred. This is done by time-warping the modification
|
|
time of the queue file into the future. The queue file is not
|
|
eligible for a retry if its modification time is not yet reached.
|
|
</p>
|
|
|
|
<p> The "cool-off" time is at least $<a href="postconf.5.html#minimal_backoff_time">minimal_backoff_time</a> and at
|
|
most $<a href="postconf.5.html#maximal_backoff_time">maximal_backoff_time</a>. The next retry time is set by doubling
|
|
the message's age in the queue, and adjusting up or down to lie
|
|
within the limits. This means that young messages are initially
|
|
retried more often than old messages. </p>
|
|
|
|
<p> If a high volume site routinely has large "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a>, it
|
|
may be useful to adjust the <a href="postconf.5.html#queue_run_delay">queue_run_delay</a>, <a href="postconf.5.html#minimal_backoff_time">minimal_backoff_time</a> and
|
|
<a href="postconf.5.html#maximal_backoff_time">maximal_backoff_time</a> to provide short enough delays on first failure
|
|
(Postfix ≥ 2.4 has a sensibly low minimal backoff time by default),
|
|
with perhaps longer delays after multiple failures, to reduce the
|
|
retransmission rate of old messages and thereby reduce the quantity
|
|
of previously deferred mail in the "<a href="QSHAPE_README.html#active_queue">active" queue</a>. If you want a really
|
|
low <a href="postconf.5.html#minimal_backoff_time">minimal_backoff_time</a>, you may also want to lower <a href="postconf.5.html#queue_run_delay">queue_run_delay</a>,
|
|
but understand that more frequent scans will increase the demand for
|
|
disk I/O. </p>
|
|
|
|
<p> One common cause of large "<a href="QSHAPE_README.html#deferred_queue">deferred" queues</a> is failure to validate
|
|
recipients at the SMTP input stage. Since spammers routinely launch
|
|
dictionary attacks from unrepliable sender addresses, the bounces
|
|
for invalid recipient addresses clog the "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> (and at high
|
|
volumes proportionally clog the "<a href="QSHAPE_README.html#active_queue">active" queue</a>). Recipient validation
|
|
is strongly recommended through use of the <a href="postconf.5.html#local_recipient_maps">local_recipient_maps</a> and
|
|
<a href="postconf.5.html#relay_recipient_maps">relay_recipient_maps</a> parameters. Even when bounces drain quickly they
|
|
inundate innocent victims of forgery with unwanted email. To avoid
|
|
this, do not accept mail for invalid recipients. </p>
|
|
|
|
<p> When a host with lots of deferred mail is down for some time,
|
|
it is possible for the entire "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a> to reach its retry
|
|
time simultaneously. This can lead to a very full "<a href="QSHAPE_README.html#active_queue">active" queue</a> once
|
|
the host comes back up. The phenomenon can repeat approximately
|
|
every <a href="postconf.5.html#maximal_backoff_time">maximal_backoff_time</a> seconds if the messages are again deferred
|
|
after a brief burst of congestion. Perhaps, a future Postfix release
|
|
will add a random offset to the retry time (or use a combination
|
|
of strategies) to reduce the odds of repeated complete "<a href="QSHAPE_README.html#deferred_queue">deferred" queue</a>
|
|
flushes. </p>
|
|
|
|
<h2><a name="credits">Credits</a></h2>
|
|
|
|
<p> The <a href="qshape.1.html">qshape(1)</a> program was developed by Victor Duchovni of Morgan
|
|
Stanley, who also wrote the initial version of this document. </p>
|
|
|
|
</body>
|
|
|
|
</html>
|