This may well turn out to be another oops. Sometimes when I screw
around with the mail system, it's a big win, and
sometimes it's a big
lose. I don't know yet how this will turn out.
Since I moved house, I have all sorts of internet-related problems
that I didn't have before. I used to do business with a small ISP,
and I ran my own web server, my own mail service, and so on. When
something was wrong, or I needed them to do something, I called or
emailed and they did it. Everything was fine.
Since moving, my ISP is Verizon. I have great respect for Verizon as a
provider of telephone services. They have been doing it for over a
hundred years, and they are good at it. Maybe in a hundred years they
will be good at providing computer network services too. Maybe it
will take less than a hundred years. But I'm not as young as I once
was, and whenever that glorious day comes, I don't suppose I'll be around
to see it.
One of the unexpected problems that arose when I switched ISPs was
that Verizon helpfully blocks incoming access to port 80. I had moved
my blog to outside hosting anyway, because the blog was consuming too
much bandwidth, so I moved the other plover.com web services to the
same place. There are still some things that don't work, but I'm
dealing with them as I have time.
Another problem was that a lot of sites now rejected my SMTP
connections. My address was in a different netblock. A Verizon DSL
netblock. Remote SMTP servers assume that anybody who is dumb enough
to sign up with Verizon is also too dumb to run their own MTA. So any
mail coming from a DSL connection in Verizonland must be spam,
probably generated by some Trojan software on some infected Windows
box.
The solution here (short of getting rid of Verizon) is to relay the
mail through Verizon's SMTP relay service. mail.plover.com
sends to outgoing.verizon.net, and
let outgoing.verizon.net forward the mail to its final
destination. Fine.
But but but.
If my machine sends more than X messages per
Y time, outgoing.verizon.net will assume that
mail.plover.com has been taken over by a Trojan spam
generator, and cut off access. All outgoing mail will be rejected with a
permanent failure.
So what happens if someone sends a message to one of the
500-subscriber email lists that I host here? mail.plover.com
generates 500 outgoing messages, sends the first hundred or so through
Verizon. Then Verizon cuts off my mail service. The mailing list
detects 400 bounce messages, and unsubscribes 400 subscribers. If any
mail comes in for another mailing list before Verizon lifts my ban,
every outgoing message will bounce and every subscriber
will be unsubscribed.
One solution is to get a better mail provider. Lorrie has an
Earthlink account that comes with outbound mail relay service. But
they do the same thing for the same reason. My Dreamhost subscription
comes with an outbound mail relay service. But they do the same thing
for the same reason. My Pobox.com account comes with an
unlimited outbound mail relay service. But they require SASL
authentication. If there's SASL patch for qmail, I haven't been able
to find it. I could implement it myself, I suppose, but I don't
wanna.
So far there are at least five solutions that are on the "eh, maybe,
if I have to" list:
- Get a non-suck ISP
- Find a better mail relay service
- Hack SASL into qmail and send mail through Pobox.com
- Do some skanky thing with serialmail
- Get rid of qmail in favor of postfix, which presumably supports SASL
(Yeah, I know the Postfix weenies in the audience are shaking their
heads sadly and wondering when the scales will fall from my eyes.
They show up at my door every Sunday morning in their starched white
shirts and their pictures of DJB with horns and a pointy tail...)
It also occurred to me in the shower this morning that the old ISP might be
willing to sell me mail relaying and nothing else, for a small fee.
That might be worth pursuing. It's gotta be easier than turning qmail-remote
into a
SASL mail client.
The serialmail thing is worth a couple of sentences, because there's an
autoresponder on the qmail-users mailing-list that replies with "Use serialmail. This is discussed
in the archives." whenever someone says the word "throttle". The serialmail
suite, also written by Daniel J. Bernstein, takes a
maildir-format directory and posts every message in it to some remote
server, one message at a time. Say you want to run qmail on your laptop.
Then you arrange to have qmail deliver all its mail into a maildir, and
then when your laptop is connected to the network, you run serialmail, and it
delivers the mail from the maildir to your mail relay host. serialmail is
good for some throttling problems. You can run serialmail under control of a
daemon that will cut off its network connection after it has written a
certain amount of data, for example. But there seems to be no easy
way to do what I want with serialmail, because it always wants to deliver
all the messages from the maildir, and I want it to deliver
one message.
There have been some people on the qmail-users mailing-list asking for something close to
what I want, and sometimes the answer was "qmail was designed to deliver
mail as quickly and efficiently as possible, so it won't do what you
want." This is a variation of "Our software doesn't do what you want,
so I'll tell you that you shouldn't want to do it." That's another
rant for another day. Anyway, I shouldn't badmouth qmail-users mailing-list, because the
archives did get me what I wanted. It's only a stopgap solution, and
it might turn out to be a big mistake, but so far it seems okay, and
so at last I am coming to the point of this article.
I hacked qmail to support outbound message rate throttling. Following a
suggestion of Richard Lyons from the qmail-users mailing-list, it was much easier to do than I had
initially thought.
Here's how it works. Whenever qmail wants to try to deliver a message to
a remote address, it runs a program called qmail-remote. qmail-remote is responsible for
looking up the MX records for the host, contacting the right server,
conducting the SMTP conversation, and returning a status code back to
the main component. Rather than hacking directly on qmail-remote, I've
replaced it with a wrapper. The real qmail-remote is now in
qmail-remote-real. The qmail-remote program is now written in Perl.
It maintains a log file recording the times at which the last few
messages were sent. When it runs, it reads the log file, and a policy
file that says how quickly it is allowed to send messages. If it is
okay to send another message, the Perl program appends the current
time to the log file and invokes the real qmail-remote. Otherwise, it sleeps
for a while and checks again.
The program is not strictly correct. It has some race conditions.
Suppose the policy limits qmail to sending 8 messages per minute. Suppose
7 messages have been sent in the last minute. Then six instances of
qmail-remote might all run at once, decide that it is OK to send a message, and send
one. Then 13 messages have been sent in the last minute, which
exceeds the policy limit. So far this has not been much of a
problem. It's happened twice in the last few hours that the system
sent 9 messages in a minute instead of 8. If it worries me too much,
I can tell qmail to run only one qmail-remote at a time, instead of 10. On a normal
qmail system, qmail speeds up outbound delivery by running multiple qmail-remote
processes concurrently. On my crippled system, speeding up outbound
delivery is just what I'm trying to avoid. Running at most one qmail-remote at
a time will cure all race conditions. If I were doing the project
over, I think I'd take out all the file locking and such, and just run
one qmail-remote. But I didn't think of it in time, and for now I think I'll
live with the race conditions and see what happens.
So let's see? What else is interesting about this program? I made
at least one error, and almost made at least one more.
The almost-error was this: The original design for the program was
something like:
- do
- lock the history file, read it, and unlock it
until it's time to send a message
- lock the history file, update it, and unlock it
- send the message
This is a classic mistake in writing programs that run concurrently
and update a file. The problem is that process
A
update the file after process
B reads but before
B
updates it. Then
B's update will destroy
A's.
One way to fix this is to have the processes append to the history
file, but never remove anything from it. That is clearly not a
sustainable strategy. Someone must remove expired entries from the
history file.
Another fix is to have the read and the update in the same critical
section:
- lock the history file
- do
until it's time to send a message
- update the history file and unlock it
- send the message
But that loop could take a long time, during which no other
qmail-remote process
can make progress. I had decided that I wanted to try to retain the
concurrency, and so I wasn't willing to accept this.
Cleaning the history file could be done by a separate process that
periodically locks the file and rewrites it. But instead, I have the qmail-remote
processes to it on the fly:
- do
- lock the history file, read it, and unlock it
until it's time to send a message
- lock the history file, read it, update it, and unlock it
- send the message
I'm happy that I didn't actually make this mistake. I only thought
about it.
Here's a mistake that I did make. This is the block of code
that sleeps until it's time to send the message:
while (@last >= $msgs) {
my $oldest = $last[0];
my $age = time() - $oldest;
my $zzz = $time - $age + int(rand(3));
$zzz = 1 if $zzz 1;
# Log("Sleeping for $zzz secs");
sleep $zzz;
shift @last while $last[0] < time() - $time;
load_policy();
}
The throttling
policy is expressed by two numbers,
$msgs and
$time,
and the program tries to send no more than
$msgs messages per
$time seconds. The
@last array contains a list of
Unix epoch timestamps of the times at which the messages of the last
$time seconds were sent.
So the loop condition checks to see if fewer than
$msgs
messages were sent in the last
$time seconds. If not, the
program continues immediately, possibly posting its message. (It
rereads the history file first, in case some other messages have been
posted while it was asleep.)
Otherwise the program will sleep for a while. The first three lines
in the loop calculate how long to sleep for. It sleeps until the time
the oldest message in the history will fall off the queue, possibly
plus a second or two. Then the crucial line:
shift @last while $last[0] < time() - $time;
which discards the expired items from the history. Finally, the call
to
load_policy() checks to see if the policy has changed, and
the loop repeats if necessary.
The bug is in this crucial line. if @last becomes empty,
this line turns into an infinite busy-loop. It should have been:
shift @last while @last && $last[0] < time() - $time;
Whoops. I noticed this this morning when my system's load was around
12, and eight or nine
qmail-remote processes were collectively eating 100% of
the CPU. I would have noticed sooner, but outbound deliveries hadn't
come to a complete halt yet.
Incidentally, there's another potential problem here arising from the
concurrency. A process will complete the sleep loop in at most
$time+3 seconds. But then it will go back and reread the history
file, and it may have to repeat the loop. This could go on
indefinitely if the system is busy. I can't think of a good way to
fix this without getting rid of the concurrent qmail-remote processes.
Here's the code. I
hereby place it in the public domain. It was written between 1 AM and
3 AM last night, so don't expect too much.