[IRCServices Coding] Analysis of the 5.1.2 bug

Andrew Church achurch at achurch.org
Tue Oct 30 00:45:50 PDT 2007


    Now that a bit of time has passed since the release of Services
5.1.2, and (hopefully) everyone has upgraded, I thought I'd post a more
detailed explanation of exactly how the critical bug that prompted this
release came about.  (As you are probably aware, various and sundry bugs
have pushed Services all the way to 5.1.6 since then, but in this text
I'll focus on the bug that was fixed by 5.1.2.  Just as a reminder, all
5.1.x releases through 5.1.5 have bugs allowing users to crash Services,
so you should immediately upgrade to 5.1.6 if you haven't already.)

The Bug
=======

     The bug itself is fairly straightforward (as those who read the
diff file will have already realized): using the ChanServ AKICK VIEW
command on a channel with at least one autokick caused Services to
crash, without exception.  By default, the AKICK command requires
channel access level 100 to use; this is the basis for the "sufficiently
privileged users" text in the announcement.  In reality, unless channel
registration was disabled, any user could register a new channel (thus
gaining founder status) and use the AKICK command on that channel,
effectively allowing any user to crash Services.  (Ironically, when
channel registration was disabled, a separate bug--fixed in 5.1.3--had
the potential to let any user cause a crash.)

     A most embarrassing bug, indeed...

How It Came About
=================

     One of the changes implemented in Services 5.1 was to unify (from
an interface standpoint, at least; internally, there's still far too
much code duplication) the listing functionality of all pseudoclient
commands.  Among other things, this included removing the entry numbers
from the ChanServ ACCESS, XOP (SOP/AOP/HOP/VOP/NOP), and AKICK commands,
mostly to avoid the confusion that can result from multiple users
deleting entries in the middle of the list.

     Conceptually, this is a simple change, only requiring the removal
of the corresponding parameter from the code that outputs entries to the
user.  But theory and practice are different beasts, as they say; in the
case of AKICK VIEW, the data for each entry is formatted using a
language-specific string, requiring a change not only in the code itself
but also in each of the language files.  I actually made this latter
change, presumably with the intent of going back and updating the code
once I had all the language files updates out of the way.  However, it
seems that I forgot to make that final update, leaving the entry number
present in the the code that formatted autokick entries for the VIEW
subcommand.  This resulted, eventually, in something like printf("%s",1)
--which of course crashed when the program tried to dereference the
entry number as a pointer.

     Ordinarily, GCC can catch inconsistencies between format strings
and parameters like this; I always compile Services using GCC's -Werror
option, to ensure I catch as many potential problems as possible.  With
language strings, however, the format string itself is located in a data
file separate from the source code.  Since the data isn't accessible to
the compiler, these checks can't be made, and the error slipped by
unnoticed until this bug was reported.

Are Any Other Commands At Risk?
===============================

     With respect to this particular change, no.  The removal of list
entry numbers affected only the ChanServ ACCESS LIST, XOP LIST, and
AKICK LIST/VIEW commands.  Of these, XOP LIST and AKICK LIST use literal
format strings for their output, which were corrected as part of the
code change; ACCESS LIST uses the language system, but its code and
format strings were both updated properly, and it is likewise not
susceptible to this problem.

     With respect to all commands in general, all I can say is "I hope
not."  I've done my best, within the constraints of time and an eleven-
year-old codebase, to program defensively; since 5.1.2, I've also gone
over all uses of language strings, and added some checks to my release
process (see below) to help avoid any more slipping in.  But given that
I let these bugs by, who knows what else might be hiding?

Lessons To Be Learned
=====================

Lesson 1: Test suites

     Ah, our good old friend, testing.  I actually had started work on a
test suite for Services at one point, which was the main motivation for
the OperServ debug commands.  Regrettably, between a lack of time in my
earlier days of Services development (being not nearly as fluent in
progamming then as I am now, writing a test suite to interact with a
server was a formidable task, and I chose to use my time working on
Services itself instead) and a lack of interest later, I never took the
test suite beyond simple nickname operations.

     Of course, even a basic test that just ran through all the commands
would have caught this bug, and to that I can only say:  Mea culpa.
Lack of interest or otherwise, I failed to follow best practices, and
that failure came back to bite me.

Lesson 2: Format strings are dangerous

     Services makes use of the standard printf()-style format strings in
its language files, partly out of convenience and partly because I
didn't know any better when I designed the language system.  While the
flexibility of format strings is undeniably useful, their implementation
leaves much to be desired--especially in a language like C that does not
pass type information around.  In fact, inconsistencies between format
strings in different language files has resulted in similar crashes in
the past (including the less-dangerous one fixed in version 5.1.3).  Of
course, most of the danger in using format strings comes from the fact
that strings are second-class citizens in the C data type world: to
process a string, you access successive bytes through a pointer--and if
that particular value happens to be something that's not a pointer,
BOOM.  So this lesson could just as easily be "C strings are dangerous".

     So what can be done about it?  Well, one of the obvious solutions--
and probably the best idea in the long run--is to rewrite the
pseudoclients in a higher-level language, such as Perl; Perl's dynamic
typing ensures that this sort of problem can't occur.  (Then again,
dynamically typed languages have their own problems, so this isn't a
perfect solution.)  I'd actually had thoughts about implementing a Perl
base for pseudoclients in a future version of Services, but I don't have
nearly enough time or interest to do so at this point.

     Another option would be to change the language file system to use a
custom formatting system.  Parameters could be named, for example, and
an array of structures including parameter names and data types along
with the parameters themselves could be passed to routines like
notice_lang().  Of course, this has the downside of turning what's
currently a simple list of parameters into a more complex variable
declaration; and since the data types would have to be specified
manually, there's still the danger of mismatched types, leading to
potential crashes.  Nonetheless, this might be a more feasible approach
for those who are interested in improving Services' robustness but don't
want to go as far as rewriting half of it in Perl.

     I've implemented two crude solutions in Services 5.1.3.  One is the
use of a script, modified from one provided by German translator Jacek
Margos, to check all language files for format string inconsistencies at
release time.  The other is a simple compilation check: at release time,
I run a test compilation where notice_lang() and notice_help() calls
using fixed message IDs are replaced with notice() calls that use the
corresponding strings from the English language file (lang/en_us.l).
The resulting executable is of course useless from a multilingual point
of view, but the compiler will complain about any cases where format
strings and parameters don't match.  While this won't catch cases where
the format string is selected by a variable, it would have found the
particular bug fixed in version 5.1.2.

Final Thoughts
==============

     Needless to say, this bug in particular, as well as the fact that I
had to make five releases of a supposedly "stable" version of Services
in the space of ten days to correct other bugs of varying seriousness,
is quite embarrassing.  I'm of course acutely aware of the trouble it's
caused everyone using Services, of the repeated interruptions in normal
network operations that will have resulted from upgrades (if not
crashes)--and let none say that it's "just IRC"!  But at the same time,
I take pride in my work as a software developer, and seeing myself make
such trivial mistakes as these is, well, depressing to say the least.
I could make excuses that the code base is aging (which is true), or
that I've lost too much interest in the program to do a proper job
(which is also true, to be honest--that's the main reason I declared 5.1
to be the final version of Services).  I could even claim that I've got
another project twice the sice of Services that's considerably more
stable (which is true as well, thanks in no small part to the experience
I've gained from Services itself).  But since none of that changes the
facts, all I can say in the end is that I screwed up, and I'm sorry.

  --Andrew Church
    achurch at achurch.org
    http://achurch.org/