UrBackup Client Unstable crashing on Windows XP

I just noticed that the new Unstable UrBackup client crashed on Windows XP. Looking at the logfile did not give any indication where the error came from. I could however say where it did not come from, because only one thread crashed and the other ones continued writing debug info into the logfile.

To get some more info where this error was happening I installed userdump on the XP laptop (http://support.microsoft.com/kb/241215). This gave me a nice image of the process memory at the point of failure. Sadly Visual Studio cannot display that much information about the dump file. For example I did not get it to display a call stack. I went on to install WinDbg which is part of the Windows SDK. This had all the information needed to pinpoint the problem. It showed a call stack including line numbers in the source files. It was however mysterious how it got the location of the source files and the line number of the error, because of course I used a release build on the XP laptop and release versions do not include debug information. Strange.

Even though it could display everything quite fine WinDbg complained about missing debug information. Which is, as explained, only natural. But then why could it show the call stack with function names?

Analyzing the information WinDbg provided did not help: The error was at a position where normally no error should occur. I double checked.

So whatever magic WinDbg does it must be wrong. Right? I continued to point WinDbg at the right debug information, but that did not change the position of the error in the code. I was just in the process of collecting all the right DLLs to get a debug build to run on the XP laptop, when the day saving thing happened: I cleaned the whole project and build everything. The universal advise for every problem computer related: “Have you tried turning it off and on again?”. Of course it worked perfectly after that.

Visual Studio must have done something wrong in the calculation of what it has to rebuild, causing the XP build target to be outdated, running with a more recent host process. This caused a function to be called without a parameter which now has one, which then caused the memory access error.

Once again the solution was easy but finding the solution was hard.

Why you shutting down an application gracefully even though it’s difficult and you think you don’t need to

UrBackup is a highly threaded application. Currently, for every client a server has approximately 5 threads are started. Additionally there is a thread pool for requests to the webinterface.
Some people would claim this is a difficulty in itself, because managing resources can be difficult the more threads you have. Anticipating this I built UrBackup using message passing and the implicit synchronization it provides.
Partly because of this I didn’t have that much problems.
What is a problem is how to tear down all those threads again. One solution which is currently “in use” is to let the operating system handle that. Let me elaborate on this:

The assumption is that you designed your application in a way that it can be forcefully stopped at any time. You should do that for every application that saves some kind of data, because nobody can guarantee for any length of time, that your procedure which stores data will not be interrupted and not started again. This interruption can be caused by a shortage of memory, a user killing the process or a power outage/hardware failure killing the computer.

Now designing an application like this can be hard. For example you write some settings to a file. You cannot write it directly there, because this write could be interrupted, causing an invalid settings file. The common pattern in such a case is to write the settings to a temporary file and to rename that file to your settings file. This can be done because the POSIX standard for filesystems defines that the rename operation is atomic – your filesystem guarantees that the rename either happens to the whole file or not at all (There are some different interpretations of this standard, however, which caused some issues e.g. with btrfs).

Now doing this with a lot of data is of course not efficient enough. You should use some library which does that for you. UrBackup uses sqlite (Which is an embedded database).

So you painstakingly did everything in a way that your program can be stopped at any time and no data is corrupted, or too much data lost. Do you still need to shut down your program gracefully? Assuming that only data is lost, that can be regenerated: No.

You can approach the problem from the other side as well. Assuming that you have time t during shutdown to write out data so nothing important gets lost you have to write your program in way that only so much data accumulates that you can still save it within t. Because sometimes you don’t now what t is or how long it is going to take to save your data you have to save important data as fast as possible. This means that the data which you would write at shutdown is not important – saving it is not necessary: You do not need save anything at shutdown.

So if every application does something like that, why does e.g. Windows not encourage the users to turn off the computer by removing their power source?

The first point here is of course that not every application does that. The second point is that sometimes you do not want them to do it: E.g. your laptop disk spins down. The application saves a little bit of data: The disk has to spin up again, in order to save the data.

The third and most important point is that often it cannot be guaranteed, that with one hundred percent probability the data is not corrupted by something like this. This can be the disk’s firmware/hardware fault, the filesystem driver’s fault or your programs fault. Do you still remember Windows laboriously checking your filesystem after you shut it down the hard way (It still does it some times)? This was in order to restore the integrity of the filesystem. Now the filesystems have gotten better at guaranteeing that integrity after a forcefull shutdown. But not with onehundred percent probability. It could guarantee it, but only with a performance penalty (that’s why it’s not done).

UrBackup has the same problem. It is only with a probability close to onehundred percent guaranteed that it’s database is not corrupted after a shutdown. There would be an option to further reduce this very small probability – with a huge performance penalty.

Of course the opportunities of such a corruption occurring should be minimized. Which means when possible one should not rely on the assumption that your program can be interrupted at any time.

That means for UrBackup, that I will have to sit down and work on a clean shutdown at some point in the future.