[volunteers] SVLUG Mail Working?

Rick Moen rick at linuxmafia.com
Thu Sep 13 18:42:25 PDT 2007


Quoting Mark Weisler (mark at weisler-saratoga-ca.us):

> Hi Rick,
> Just a little concerned here...
> Just before 4pm today I sent...
> a) a reminder about the Installfest to SVLUG announce, and...
> b) a note to Volunteers offering to do a Nifty (on Nomachine)
> 
> It's now 5:41pm and I'm not seeing either of those mails. My mail is otherwise 
> working. Anything going on?

Dunno.  Let's see:

http://lists.svlug.org/lists/admindb/svlug-announce and 
http://lists.svlug.org/lists/admindb/volunteers

...do not show your messages as being held for moderator attention.

Let's SSH in and check the MTA.  

/me sshes in, su's to root, has a look around in /var/log/exim4/

File "paniclog" has:

2007-09-13 12:47:50 1IVufD-0005vx-0d SA: Debug: SAEximRunCond expand returned: '1'
2007-09-13 12:47:54 1IVufD-0005vx-0d failed to write to main log: length=83 result=-1 errno=12 (Cannot allocate memory)


Well, that can't be good.

This looks OK:

# uptime
 17:53:41 up 2 days, 24 min,  2 users,  load average: 0.03, 0.04, 0.00
#

Let's check "free":

# free
             total       used       free     shared    buffers  cached
Mem:        513336     508136       5200          0      63664  27184
-/+ buffers/cache:     417288      96048
Swap:       530136     500272      29864

#


Wow, that's pretty hammered on RAM.

Looking at "top", I see a significant number of spamd (spamassassin)
processes, the largest five of which are taking up 56% of total RAM.

Temporarily stopping SpamAssassing and Exim4 (the MTA) frees up a huge
pile of RAM:

# free
             total       used       free     shared    buffers   cached
Mem:        513336     143676     369660          0      61760   27216
-/+ buffers/cache:      54700     458636
Swap:       530136       8272     521864

#

I'm starting spamd (SpamAssassin) _without_ starting Exim4, just to see
what then happens.  There are 32 instances of spamd that then start,
which is the allowed maximum:

# ps auxw | grep spamd | wc -l
     34
# 

(Two of them, above 32, are the grep process itself, and the parent
spamd process that spawns off all the copies that do the actual work.)

I _hope_ that's no longer a runaway condition with spamd.  Each of the
32 instances ends up with 27MB RSS (resident set size), whichi is
reported as 5% of total RAM.

"free" is now like this:

# free
             total       used       free     shared    buffers   cached
Mem:        513336     196440     316896          0      63224   31292
-/+ buffers/cache:     101924     411412
Swap:       530136       7080     523056


OK, I've just restarted Exim4, somewhat tentatively.  See, my working
theory was that a bunch of spamd processes were chewing on spam and got
stuck / hanged / blew up.  SpamAssassin is a huge honking Perl script,
so there's always the chance of that happening.  The legacy www.svlug.org 
host has an _early_ version of Marc Merlin's attempt at reasonable use 
of spamd within Exim4, but Marc's setup isn't really great at
intercepting the first cut of spam just using (small, fast, C-based)
Exim4 rulesets, _before_ handing the incoming stream off to spamd.  
Therefore, www.svlug.org puts a huge load directly onto spamd, all the
time.  All it takes is for a few runaway Perl/spamd processes, and you
start exhausting memory.


You didn't say what address you sent your two mails from.  FWIW, 
I cannot find any sign of them in the 2007-09-13 entries in
/var/log/exim4/mainlog, where all incoming accepted mail is shown, 
or /var/log/exim4/rejectlog, where all incoming rejected mail shows up.

I'm now seeing messages being processed by Exim4, and no more signs of
memory exhaustion.  The mailman qrunner process was working.  Just to be
sure it wasn't clobbered by the prior RAM shortage, I restarted it.  

Mailman has its own logs under /var/local/mailman/logs -- a horrible
idea, but one of the many unfortunate consequence of the Crazy
Frenchman's reliance on locally built betaware.  

/var/local/mailman/logs/smtp-failure had many, many pairs of lines like
this:

Sep 13 17:59:39 2007 (22833) Low level smtp error: (111, 'Connection refused'), msgid: <mailman.1.1189712965.22829.svlug at lists.svlug.org>
Sep 13 17:59:39 2007 (22833) delivery to svlug-owner at lists.svlug.org failed with code -1: (111, 'Connection refused')


5:59 PM, was the last of those -- matching the time I brought an end to
the RAM crunch by killing the spamd processes.  I interpret these
failures as a secondary effect of the RAM crunch.


Similarly, there were a bunch of errors like this in
/var/local/mailman/log/error:

Sep 13 12:47:44 2007 qrunner(1300): Traceback (most recent call last):
Sep 13 12:47:55 2007 qrunner(1300):   File "/var/local/mailman/bin/qrunner", line 270, in ?
Sep 13 12:47:56 2007 qrunner(1300):      main()
Sep 13 12:47:57 2007 qrunner(1300):   File "/var/local/mailman/bin/qrunner", line 230, in main
Sep 13 12:47:57 2007 qrunner(1300):      qrunner.run()
Sep 13 12:47:58 2007 qrunner(1300):   File "/var/local/mailman/Mailman/Queue/Runner.py", line 70, in run
Sep 13 12:47:58 2007 qrunner(1300):      filecnt = self._oneloop()
Sep 13 12:47:58 2007 qrunner(1300):   File "/var/local/mailman/Mailman/Queue/Runner.py", line 94, in _oneloop
Sep 13 12:47:58 2007 qrunner(1300):      files = self._switchboard.files()
Sep 13 12:47:58 2007 qrunner(1300):   File "/var/local/mailman/Mailman/Queue/Switchboard.py", line 158, in files
Sep 13 12:47:58 2007 qrunner(1300):      for f in os.listdir(self.__whichq):
Sep 13 12:47:58 2007 qrunner(1300): OSError :  [Errno 12] Cannot allocate memory: '/var/local/mailman/qfiles/out' 


Weirdly, those started at 12:47 PM today and abruptly ended about a
minute later.  Dunno.

/var/local/mailman/log/post (the record of messages that make it all the
way _through_ Mailman) shows "success" lines through 11:15 AM this
morning, like this:

Sep 12 23:11:02 2007 (1300) post to mailman-owner from mailman-owner-bounces at lists.svlug.org, size=1634, message-id=<mailman.1.1189663860.7739.svlug at lists.svlug.org>, success
Sep 13 01:10:36 2007 (1300) post to mailman-owner from web-team-owner at lists.svlug.org, size=5456, message-id=<mailman.40.1189671029.1298.web-team at lists.svlug.org>, success
Sep 13 11:16:48 2007 (1300) post to mailman-owner from officers-bounces at lists.svlug.org, size=51538, message-id=<mailman.43.1189707402.1298.officers at lists.svlug
.org>, success


Then, immediately after that, at 12:49 PM, you start seeing a long
series of failures:

Sep 13 12:49:25 2007 (22833) post to jobs from kswsnhdb at blossomingbelly.com, size=4821, message-id=<01c37a2f$ff58d810$72cc1044 at kswsnhdb>, 1 failures
Sep 13 12:49:29 2007 (22833) post to svlug from svlug-bounces at lists.svlug.org, size=993, message-id=<mailman.0.1189712965.22829.svlug at lists.svlug.org>, 1 failures
Sep 13 12:49:29 2007 (22833) post to svlug from svlug-owner at lists.svlug.org, size=6539, message-id=<mailman.1.1189712965.22829.svlug at lists.svlug.org>, 1 failures
[...]
Sep 13 17:59:39 2007 (22833) post to svlug from svlug-bounces at lists.svlug.org, size=993, message-id=<mailman.0.1189712965.22829.svlug at lists.svlug.org>, 1 failures
Sep 13 17:59:39 2007 (22833) post to svlug from svlug-owner at lists.svlug.org, size=6539, message-id=<mailman.1.1189712965.22829.svlug at lists.svlug.org>, 1 failure
s

Then, exactly at 5:59 PM (when I killed spamd), that turns back to
"success":

Sep 13 18:14:41 2007 (22833) post to mailman-owner from kswsnhdb at blossomingbelly.com, size=5578, message-id=<01c37a2f$ff58d810$72cc1044 at kswsnhdb>, success
Sep 13 18:14:46 2007 (22833) post to mailman-owner from svlug-owner at lists.svlug.org, size=7695, message-id=<mailman.1.1189712965.22829.svlug at lists.svlug.org>, success
Sep 13 18:15:04 2007 (22833) post to smaug from dale at ucsc.edu, size=2583, message-id=<46E99D91.2090104 at ucsc.edu>, success


So, that, too, is absolutely consistent with it being simple RAM
exhaustion from runaway spamd processes.

I'm going to CC this to Volunteers.  Let's see if that works.




More information about the volunteers mailing list