Given that software development has so many ties into mathematics, I’m always surprised by how seldom it seems to show up in my day to day life. However, today was a little bit different.

## Finding success through failure

One of the items that I dealt with, where some math was useful, was in some test failures that we’ve been seeing at work. The software is run through its automated tests in a number of variations. There are two axes of variation: the style of installation and the operating system it is running on. The number of total variations that get run is then simply the product of the sizes of the two axes. In this case there are 4 and 14, which gives a total of 56 variations.

Each time the entire suite of tests ran, there would be some seemingly random number of them that would fail for a reason that we couldn’t, and still can’t, explain. It was completely random from everything that we could see. However, the number that failed seemed to be somewhat consistent. It always seemed to end up with around 11 of the variations failing. That meant that 11/56 = 0.19642 or ~20% of the variations failed every time.

Now, it is really tempting to take those failed variations and run just them, *and none of the others*, again. For some reason it feels like you are learning something about the failures by doing that (there are cases in which you might, but let’s not get into that right now). However, the key to this is that which ones fail has been random in every occurrence that we have seen it. If 20% of the original set failed, then the expectation is that 20% of the rerun set would fail.

In my real life situation, those 11 failed variations *were* rerun. What happened? Well, what should I expect to happen? If this really is a random occurrence, then 20% of the 11 should fail again. 11 * 0.20 = 2.2. So I should expect about 2 of the variations to fail again. What actually happened? One of them failed. In such a small sample set, that seems to fall pretty well inside the realm of probability.

Let’s step back for a bit. In this kind of a situation, how many re-runs would I need to do to get a “fully passing” run. One in which, by rerunning the failures, I eventually get to the point where every variation has passed? This comes about by repeated application of the 20%. In the first run 20% fail. In the second run 20% of the 20% fail, which is 4% of the total. On a third run 20% of the 20% of the 20% fail, which is 0.8% of the total. At that point there is a 99.2% chance that every failed variation has passed.

This is exactly what we saw when the failed variations were re-run. On the third run (the second rerun) everything had magically passed! Everything was good! Unicorns and rainbows!

Except not. This was just the inevitable outcome of rerunning the failed variations, when the failures strike randomly. The 20% chance of failure is still there. It hasn’t gone away.

Mathematics destroys my illusions.

Something was a little fishy in there. Let’s look at some of those numbers again. The first run had a 20% failure rate (11/56 = 0.19642857 … not quite 20% but close enough for our purposes). However, a 20% failure rate predicts that the subsequent run would see 2.2 failures. Since a variation can only pass or fail, I interpret that as either two or three failures. But there was only one failure! What happened?

Let me explain a little bit about what the failure is that is showing up in these tests. Every once in a while the system times out. The client is making a request to the server and the connection appears to just hang, timeout, and then fail. The only hunch that we have is that this is somehow related to load on the system. When the tests are run, there are about 30 of the test variations running at once, in addition to any other test jobs that might be running at the same time. The hypothesis is that all of the running tests are somehow causing things to slow down and, at times, time out. This is bolstered by the fact that when we try to run any one of the tests that shows the failure independent of the entire run of all of the variations we don’t see the errors.

So let’s go back to those numbers. If this really is some sort of load problem on the testing infrastructure, then by rerunning just the smaller number of variations (11), that might have changed the probability of encountering the problem. Instead of it being a 20% probability, which predicts seeing 2 failures, there seems to be a 9% probability. That seems like another data point that points toward this being something related to the load on the test infrastructure.