Carpet testing

Carpet bombing is a "large aerial bombing done in a progressive manner to inflict damage in every part of a selected area of land." Similarly, carpet testing is done by progressively tossing random data samples at your code without regard for its internal structure, hoping that sufficient amount of data will eventually cover it all.

Or at least, this is the analogy that kept floating up in my mind while I was refactoring discovery testing in python-openid. This post is a part of the series.

Examples here are rather long to better show the point I want to make. But at the same time they aren't complicated and aren't supposed to be followed line by line anyway.

Simple example

class TestIsOPIdentifier(unittest.TestCase):
    def setUp(self):
        self.endpoint = discover.OpenIDServiceEndpoint()

    def test_none(self):
        self.assertFalse(self.endpoint.isOPIdentifier())

    def test_openid1_0(self):
        self.endpoint.type_uris = [discover.OPENID_1_0_TYPE]
        self.assertFalse(self.endpoint.isOPIdentifier())

    def test_openid1_1(self):
        self.endpoint.type_uris = [discover.OPENID_1_1_TYPE]
        self.assertFalse(self.endpoint.isOPIdentifier())

    def test_openid2(self):
        self.endpoint.type_uris = [discover.OPENID_2_0_TYPE]
        self.assertFalse(self.endpoint.isOPIdentifier())

    def test_openid2OP(self):
        self.endpoint.type_uris = [discover.OPENID_IDP_2_0_TYPE]
        self.assertTrue(self.endpoint.isOPIdentifier())

    def test_multipleMissing(self):
        self.endpoint.type_uris = [discover.OPENID_2_0_TYPE,
                                   discover.OPENID_1_0_TYPE]
        self.assertFalse(self.endpoint.isOPIdentifier())

    def test_multiplePresent(self):
        self.endpoint.type_uris = [discover.OPENID_2_0_TYPE,
                                   discover.OPENID_1_0_TYPE,
                                   discover.OPENID_IDP_2_0_TYPE]
        self.assertTrue(self.endpoint.isOPIdentifier())

This whole test case is dedicated to this one method:

def isOPIdentifier(self):
    return OPENID_IDP_2_0_TYPE in self.type_uris

That's right. Seven tests to test an in operation on a list. This might be justifiable for a "black box" testing of unknown code. But here it shouldn't be more complicated than testing just two states in a single test.

Generated example

This code is more OpenID-specific but the problem is still the same. As soon as you start writing too much of not too different code you want to automate it and naturally come to generated tests:

@gentests
class Discover(unittest.TestCase):
    data = [
        ("equiv",               (True, "equiv", "equiv" , "xrds")),
        ("header",              (True, "header", "header" , "xrds")),
        ("lowercase_header",    (True, "lowercase_header", "lowercase_header" , "xrds")),
        ("xrds",                (True, "xrds", "xrds" , "xrds")),
        ("xrds_ctparam",        (True, "xrds_ctparam", "xrds_ctparam" , "xrds_ctparam")),
        ("xrds_ctcase",         (True, "xrds_ctcase", "xrds_ctcase" , "xrds_ctcase")),
        ("xrds_html",           (False, "xrds_html", "xrds_html" , "xrds_html")),
        ("redir_equiv",         (True, "redir_equiv", "equiv" , "xrds")),
        ("redir_header",        (True, "redir_header", "header" , "xrds")),
        ("redir_xrds",          (True, "redir_xrds", "xrds" , "xrds")),
        ("redir_xrds_html",     (False, "redir_xrds_html", "xrds_html" , "xrds_html")),
        ("redir_redir_equiv",   (True, "redir_redir_equiv", "equiv" , "xrds")),
        ("404_server_response", (False, "404_server_response", None , None)),
        ("404_with_header",     (False, "404_with_header", None , None)),
        ("404_with_meta",       (False, "404_with_meta", None , None)),
        ("500_server_response", (False, "500_server_response", None , None)),
    ]

    @mock.patch('openid.fetchers.fetch', fetch)
    def _test(self, success, input_name, id_name, result_name):
        input_url, expected = discoverdata.generateResult(
            BASE_URL,
            input_name,
            id_name,
            result_name,
            success,
        )
        if expected is None:
            self.assertRaises(urllib.error.HTTPError, discover, input_url)
        else:
            result = discover(input_url)
            self.assertEqual(input_url, result.request_uri)
            self.assertEqual(result.__dict__, expected.__dict__)

Actual test methods are generated from self.data, each calling self._test with provided arguments. The test function compares a generated expected result with a result returned from a real discover function which is being tested.

It seems reasonable at first, but there's quite a few things wrong with it that nobody can see:

For starters, there are two distinct code paths — for successes and failures — that share nothing in common. Yes, even the seemingly "common" call to generateResult actually has its own if that returns completely different result for failures.
Though these tests are grouped together, they in fact test very different parts of OpenID discovery process: request headers, response headers, HTML <meta> overrides, etc. But instead of being asserted directly those are implied by the success of the overall discovery process. Which makes debugging more complicated: instead of saying: "you don't recognize this content type", your test will say: "uhm... the discovery is broken somewhere".
A few of those tests are completely useless — guess which ones? Redirects. Redirects are never even exposed to the client code, they're handled by the HTTP library. So what we're testing here is our own HTTP mock, not the discovery process.
Some of those tests hide actual bugs. For example tests 3, 4 and 5 should test, in theory, that we recognize an XRDS-formatted response even with slight variations in its Content-type header. But they don't even call the method isXRDS() responsible for this, they implicitly compare the content_type attributs of the expected and the actual results that — and here's the best part — are being generated from the same test sample!

This last bit is, by the way, makes all of those tests rather useless. Probably out of desire to reuse more code both the generateResult() and the mock fetch() ultimately read the same data file.

Killer example

I won't even paste it here as it is too big.

It's a hierarchy of a base class and three descendants accompanied by two mock fetchers. Test methods assert many different things but use a single complicated generalized test function from the base. You probably want to have a look at this test function _checkService() and at The Boss itself, the class called TestDiscovery.

All tests work approximately the same way:

call discover (with additional checks)
feed the result to self._checkService with a whole lot of control values that tell it which code path to take

In addition to all the problems that I already mentioned it has another rather obvious one: it's big. Apparently, because of the many subtle differences in individual tests the generalized testing code was becoming too complicated and made it impossible to converted this test case into a generated one. (This is just my hypothesis.)

Don't do carpet testing. It doesn't make tests "more complete" but makes it harder to reason about them, leading to hidden bugs.

Refactoring it

BOOOOORING!!!! It took me several days and a lot of patience :-).

Deconstruction of such a thing may seem an insurmountable task at first. And I made a few false starts trying to do too much too soon.

My general approach is to start reading through the file looking for obvious, easy to fix code smell, like moving a repeated function call into one place or killing a class attribute that does the job of a local variable. Dealing with those removes some code but more importantly gives you a better understanding of its scope and intentions.

Then you start noticing corner cases that differ from the general shape the most, remove their dependency on the common parts and then extract them. This groups uniform bits of code together and removes code dealing with the differences.

And then at some magical point you suddenly realize that all similar looking code is completely identical and then you just remove all the repetitions in one big swoop.

And never, ever try to "just rewrite" the entire thing from scratch!

urlopen mock

One thing that allowed me to kill a lot of custom mocking code is the generalized mock for urlopen() that I now use in all the tests. It can serve regular files from the designated test directory with the correct Content-type determined from extensions. You can also use query parameters to ask it for a specific status code or a response header:

query = {'status': 400, 'header': 'X-XRDS-Location: http://...'}
url = 'http://unittest/test-sample.html?' + urlencode(query)

I wonder if someone has already done that before?

Comments: 2

Yury Yurevich

From what I see, almost all examples fall to one or another category of smell tests from "xUnit Test Patterns". Smells from the book are highlighted by italic.

The first one is something that I cannot find in the book. Usually it's other way around — people don't write tests. In this case it looks like tests are over-thought, from my point of view.

Second example is Assertion Roulette (you cannot find what fired assert just by looking into the test results) caused by Multiple Test Conditions (test tries to apply logic to many sets of input values) and Conditional Verification Logic (test runs different asserts depending on context).

Third one is probably Untestable Test Code (test is not simple enough to be sure it is implemented correct) + Eager Test (test verifies too much) + Conditional Verification Logic (test runs different asserts depending on context).

Clearly, you can improve tests there, but at first sight, there are not that bad. I've seen much-much worse bad-smelling tests. If you're still motivated to refactor tests then I highly recommend the book — it helped me a lot to articulate to other people what was exactly wrong with their tests :)
Ivan Sagalaev

Thanks for the analysis and for the pointer! Indeed, all the cases are slightly different and the patterns you mentioned look spot on.

I also want to note that I'm in a kinda luxury position here: I don't have to convince anyone to change the code :-)