[volunteers] System was near RAM exhaustion
Rick Moen
rick at linuxmafia.com
Tue Jul 7 18:50:14 PDT 2009
Quoting Andrew Fife (andrewbfife at yahoo.co.uk):
> Ed, I believe you have been copied on multiple offlist messages
> detailing the progress of the mail server migration project. For the
> benefit of other volunteers, Tom Belote has been working on migrating
> SVLUG's mail server to newer hardware off and on over the last 3
> weeks. Transferring mailman settings is tricky and while he has been
> able to migrate the archives other things such as the list info pages
> are proving to be difficult. The migration is not a trivial project
> and its difficult to say exactly when it will be complete. Don Marti
> has now agreed to work with Tom on the project and hopefully between
> the two of them we can make a bit more progress.
My thanks to you and Tom for working on that.
The referenced mail was offlist primarily because of the current
(legacy) server being effectively hung at the time, on a weekend, which
has happened a couple of times recently. (Again, my guess: runaway
spamd process exhausted RAM. Checking logfiles supports that guess.)
Basic Mailman migration:
http://www.debian-administration.org/articles/567
I am unclear on what "other things such as the list info pages are
proving to be difficult" means. The mailing list definitions, in Python
"pickled" (binary tokenised) format, in /var/local/mailman/lists --
whose location in standard, packaged Mailman is /var/lib/mailman/lists .
Since you didn't really say what the problem is, it's difficult to help.
It's always been conceivable, if unlikely, that Mailman (or Python)
would change the format for "pickled" stored data in a way that makes
newer releases of Mailman unable to read some older version's
list-definition files -- but I've never encountered that, consider it a
bit unlikely, and don't have any special reason to think it's the
problem you hint at.
I hope you'll have seen (and maybe even read) my post elsewhere about a
cronjob Marc Merlin created on the legacy host that has been
periodically dumping all of the mailing lists' configurations to ASCII
files in the $MAILMAN_BASEDIR/backup directory (if memory serves -- and
"backup" could be off a level from that). One of the reasons to do that
is in case a situation of no forward compatibility, such as I describe
above, ever arises: You can either use the same tool[1] to import the
dumped configs as the script uses to dump them, or can just recreate the
configs manually in reference to the ASCII dumps.
You mention the "list info pages". Do you mean the pages under
http://lists.svlug.org/lists/listinfo/ ? What exactly do you mean by
"migrating" them?
The displayed HTML, e.g., on
http://lists.svlug.org/lists/listinfo/volunteers-old , is the result of
the Mailman CGI merging details from the list's configuration into a
Mailman template. Normally, the templates don't get touched: You just
use what Mailman supplies.
I mentioned volunteers-old for a reason, though, it being an exception.
Since it's an archive-only mailing list, I actually _did_ edit the
template to remove template contents appropriate only to a live,
postable mailing list, such as the posting address. You can use the
admin Web interface to view and edit a list's template, e.g., here:
http://lists.svlug.org/lists/edithtml/volunteers-old
Getting back to other details of your post: You mentioned "other
hardware". If memory serves, this is a nice P4-based system, with the
sole significant problem of having only a single 80 GB PATA drive
(versus current mirrored pair of 18GB SCSI drives in the donated VA
Linux Systems model 2230, the pair of 9GB boot drives being asserted to
have "failed"):
Offlist, you commented:
(and the case isn't setup to take a second w/o
a bit of jury rigging), so RAID provides little advantage. [...]
We'll get to that "little advantage" bit, below.
Furthermore, the mail server currently takes up about 10GB on the
current system, and we figured that, while RAID on the 2 18GB drives
would provide protection against drive failure, running out of disk
might not be that far off in the future.
This disk-space analysis turned out to be incorrect, as I explained
offlist. Files we still care about currently comprises about 3 GB
without any attempt at cleanup, and historical disk consumption rates,
if continued, would not exhaust an 18GB disk pair until more than
halfway through the 21st Century.
(More on our RAID alternative below) [...]
Lastly, here is how we intend to hedge against drive failure
on the dual-core system in lieu of RAID
-- take an image of the system when it is final and tested
-- setup an additional mail server as a stand by that is
synced with the primary mail server weekly. [...]
First off, I'd gladly donate an 80GB PATA drive if I happened to have
one sitting around, but my leftovers are almost entirely antique SCSI.
Getting back to the main point: A hot spare server is a good thing in
itself. At the same time, it's a whole lot more elaborate to set up
than a RAID1 pair. _And_ it's not addressing the same threat model.
It sort of is, but sort of isn't too.
Sysadmins and network security people talk about threat models. Please
pardon the jargon. It means "something that can plausibly go wrong,
and what we have to handle the situation if/when it does".
Redundancy, backups, and archival storage address different threat
models. See http://linuxmafia.com/faq/Admin/backup-strategy.html , and
skip down to the phrase "The topic of data backup herewith returns".
RAID1 is of course redundancy. The threat it addresses is single-drive
failure. Its presence means you do not lose state of any files on the
drive. ("State" in this context means history. If you have to revert
to a five-day-old backup, you have lost five days' worth of state.)
If all vital filesystems are either RAID1-mirrored or on a drive that
didn't fail, then you also have near-zero downtime (and no need for
rebuild work) from single drive failures. You receive e-mail from
mdstat that the array is running degraded, you take a spare drive off
the shelf, you visit Via.net during business hours. You identify which
drive is bad, down the server, open the box, put in the the new drive,
power it up, and tell the md driver to remirror onto the new drive.
And you're done.
Also, if for any reason you want to peel off an "image" of the drive,
just down the server, remove one drive from the mirror pair and put it
in your backpack, put a fresh, empty drive in, and tell md to remirror
onto it. The drive in your backpack is a portable copy of the running
mirror pair.
I mentioned that a (theoretical) hot-spare server addresses sort-of the
same threat model as does RAID1, and sort of not. To see this, consider
what happens, by comparison, if the single 80GB drive fails and you want
to failover to the hot spare.
First off, you're losing on average a half-week of file state. That's
not a tragedy. I'm just mentioning it. With RAID1, you lose nothing
from (single) drive failure, and have coninued server operation.
Second, you will either be doing DNS-based failover, or will be
counting on re-IPing the hotspare machine to Via.net's IP and
driving it over and racking it. DNS changeover will take approximately
the DNS Time To Live (TTL) value of 1 day, plus two additional days on
account of pervasive DNS caching in excess of TTL. (Changeover time
at any given Internet location thus depends on locally relevant
caching.)
Third, you will be spending time verifying that the failover host is
really doing the intended job -- and fixing any problems. The only way
to avoid this is to do test failovers in advance (which is of course a
damned good idea, as testing solutions is the only way you really know
they work).
In contrast, RAID1 failover is a great deal simpler, and to a close
approximation Just Works as soon as mdstat reports that it is.
Just to make sure I cover one vital point: As I said, RAID1 mirroring
is an appropriate precaution against _single_ drive failure. When my
home server was fried by a huge power spike in April, _both_ 9GB SCSI
drives were toasted, along with the motherboard, PSU, and one of the two
sticks of RAM. If those drives had been mirrored, they'd have been just
as dead.
Fortunately, I had backup datasets elsewhere.
A good sysadmin considers _all_ plausible threat models.
More information about the volunteers
mailing list