[IRCServices] unhappy restart quirks with 5.0.9

Arathorn arathorn at theonering.net
Thu Feb 6 23:26:09 PST 2003


On Fri, 7 Feb 2003, Andrew Church wrote:

>      I'm technically not supposed to be on the Internet right now (doctor's
> orders),

Thank you for taking the time to respond to my queries whilst technically
out of the loop - it's very much appreciated indeed.  Here's some feedback
from your suggestions for when you're feeling better:

> >Firstly, on an /msg operserv shutdown, services SQUITs from the ircd - but
> >then on almost all occasions the binary continues to run, and will not be
> >killed by any signal short of a KILL (-9).
>
>      Can you look up exactly where this is occurring?  (Use gdb to attach to
> the process when it's in the "hung" state.  If you're not familiar with gdb,
> there are at least a couple other people on this list who can help you.)

I am a complete fool for not using gdb when this happened - predictably
enough, since sending the original mail I've been completely incapable of
reproducing the problem.  Services shuts down and restarts cleanly no
matter the circumstances - as the only thing that has altered is some
debugging work on my behalf on Unreal, I'm assuming that something in the
unreal socketry was somehow to blame.  If it happens again, though, I'll
jump straight on it with gdb.

> >On a related note, /msg operserv restart normally fails in precisely the
> >same manner - but on the one occasion that it came straight back up, all
> >registered channels with a +k mode in their modelock had mysteriously lost
> >their +k and key.  Which was a bit of a pain ;)
>
>      Corrupted database?  That's the only possibility that comes to mind.
> Are you sure your CPU and memory aren't acting up?  (Try a tool like
> memtest86 [http://www.memtest86.com/].)

Well, I've spent the last few hours having much fun and games trying to
test the databases for corruption.  I did this by exporting the db to XML,
and then reimporting into a completely virgin install.  Again, I haven't
managed to recreate the "missing +k" mode bug since the original mail -
but I can guarantee that it had happened 3 times to date before that time
(once on the very first transition from 4.5.43, and twice subsequently).

In performing the xml-export from the ircservices webserver I encountered
several difficulties: when downloading from the server using Mozilla or IE
, the 1076KB file was consistently truncated by ~1700 lines/43KB (the DB
is ~800 users, ~10 channels).  When downloading locally on the server
using wget in the server's shell, an arbritary big chunky block of the
file (lines 17015-20769 out of 37962, in the particular instance i'm
looking at atm) was excitingly mangled.

On closer inspection, the 4096 byte block of data from byte offset 0x6a180
gets repeated 22 times contiguously until byte offset 0x82180.  Something
very very strange is happening - and I'm pretty damn sure that wget works
in all other situations.

Anyway, by splicing the files together, I was able to come up with a valid
xml file to reimport in the interests of ruling out the db files saved to
disk as being actually corrupted.

As regards the stability of the machine in question - it's a Dell
Poweredge 1650 with 4Gb of RAM and 2x1.26GHz PIII procs running Debian
Woody.  It's the main production server for TheOneRing.net; the biggest
Tolkien-related site on the web.  For reasonably obvious reasons, it's
pretty busy atm - the server is packed to the gills with Apache, thttpd,
mysql & CGI processes etc. cranking out around 4 million hits of various
descriptions (~700,000 pageviews) a day.  Over the last few months of
uptime, I haven't seen any evidence of memory corruption on any of those
processes - and moreover, memtest86 has been rigorously applied regardless
at points over the last 6 months whilst double checking on other problems.
So I'm inclined to assume in this instance that it isn't a hardware
problem.

> >Moreover, whilst users in the channel access lists were correctly reopped
> >on reidentifying on the server coming back up - in every channel a random
> >user also acquired an @.
>
>      That would probably be the result of CSSetChannelTimes.  A bug (or
> feature?) in Unreal requires a +/-o or similar mode change in order to
> update the channel's creation time.  Since the first user to join a channel
> always gets +o anyway, Services assumes that it's safe to send another +o
> for that first user, since it will later process the +o from the remote
> server and deop the user properly.  I don't think the case of a netjoin
> with multiple users on a channel is handled properly, though.  I'll take a
> look when I'm off vacation; for now, try turning off CSSetChannelTimes.

Turning off CSSetChannelTimes has solved this problem perfectly - serves
me right for enabling a funky sounding feature that I didn't understand
properly :)

> >Finally, in the WhatsNew for services v5, I was overjoyed to see:
> >
> >  + The Services stamp of the last user to identify for a nick is now
> >        recorded on disk, removing the necessity to re-identify when
> >        Services is restarted.
> >
> >But so far, when /msg operserv restart works at all - everyone is forced
> >to reidentify nonetheless (combined with the quirks listed above).  So I'm
> >guessing that there's something going wrong here.
>
>      This would happen if your databases were not saved after the NickServ
> IDENTIFY command.

Even after having had the above epic reconstructing the databases in case
of insidious corruption, service stamps still seem not to be being saved
to disk (or perhaps being reread properly).  I log on, identify, oper,
/msg operserv update, /msg operserv restart - and am promptly prompted to
reidentify again.  Logs available on request ;)  This happens with both
Unreal 3.1.x and 3.2 - and I'm completely stumped by it assuming that it's
working for everyone else in 5.0.9 :|

Needless to say, all other service information seems to be being stored
perfectly out to disk (with the possible exception of the +k channel
mlock setting).

> >On a final possibly unrelated note, I've also been getting
> >
> >[Mon Feb  3 23:23:24 2003] - select irc.theonering.net[127.0.0.1]:Bad file
> >descriptor
> >
> >error messages popping up in Unreal's ircd.log every 2-16 hours or so.
> >As services is the only thing connected to a fd on 127.0.0.1, I'm
> >wondering if there's a connection here to the above problem.
>
>      This is an Unreal bug, though it may be triggered by something in
> Services.  You'd have to ask the Unreal people for more details.

I've submitted it as an unreal bugreport - they seem to be singularly
nonplussed by it.  Hopefully something will come of it in time :)

Apologies for the long mail - hopefully some of it will be of some use or
interest at some point in the future :)

A.

________________________________________________________________
Matthew Hodgson   arathorn at theonering.net   Tel: +44 7968 722968
             Arathorn: Co-Sysadmin, TheOneRing.net®