Wednesday, June 16, 2010

Temporary Death

Your roof, as you know very well, has been recently repaired, and has no aperture by which even a Woman could penetrate.
Flatland, pg 81
by Edwin A. Abbott
My group at work has three main clients: 1) customers who use our product in our tightly integrated ecosystem, 2) customers who use a variant in a more traditional environment, and 3) internal clients using us to build even more complex products for customers.

Sadly (1) is what we end up testing the most.

One part of our system is heavily used by (2) and (3) but not (1). So it is no wonder that we just now found out (internally) about a race condition with a delay loaded object in software that is about 7 years old and used in mission critical applications. I know what you might be thinking but this will not be a sordid tale about the woes of singletons.

We have a critical section around the delay loading. I recently got a bug report on a crash when we have two threads racing to use the object on process start. I sat back baffled for a bit not seeing the problem in our code so moving on to blaming the tools. I know, big mistake but at some point libraries do fail. It turns out the problem was with us and was fairly simple

A little background for non-C++ folk:
In C++ there is an idiom called RAII which allows for easy handling of resources. In this case when tScopedLock is created the lock is acquired. At the end of scope the lock is released. Whenever I see a garbage collected OOP language without an equivalent a little part of me dies having to see all of those "finally" clauses. I was very glad Python added the "with" statement and "from __future__ import with_statement" can be found in most of my python scripts.
void tClass::loadIfNeeded()
{
tScopedLock(lock);
if(!loaded)
{
load();
}
}
With a coding convention that is discouraging of spaces things tend to bleed together. So sadly it took me digging into the Windows Critical Sections data structure to realize when the count inside it was being modified. So I went and put the fix in

void tClass::loadIfNeeded()
{
tScopedLock scopedLock(lock);
if(!loaded)
{
load();
}
}

Oh the joys of accidental temporaries and their scope which lasts for the expression they appear in (usually).

Of course I was a (mostly) good citizen. I grepped through our code and fixed about 5 instances of this problem. I then added a check in our (very limited) static analysis to watch for people adding them. On an internal forum people came up with some pre-proc magic to try and help. Maybe later I'll investigate that.