Carpet bombing is a "large aerial bombing done in a progressive manner to inflict damage in every part of a selected area of land." Similarly, carpet testing is done by progressively tossing random data samples at your code without regard for its internal structure, hoping that sufficient amount of data will eventually cover it all.
Or at least, this is the analogy that kept floating up in my mind while I was refactoring discovery testing in python-openid. This post is a part of the series.
Examples here are rather long to better show the point I want to make. But at the same time they aren't complicated and aren't supposed to be followed line by line anyway.
Simple example
class TestIsOPIdentifier(unittest.TestCase):
def setUp(self):
self.endpoint = discover.OpenIDServiceEndpoint()
def test_none(self):
self.assertFalse(self.endpoint.isOPIdentifier())
def test_openid1_0(self):
self.endpoint.type_uris = [discover.OPENID_1_0_TYPE]
self.assertFalse(self.endpoint.isOPIdentifier())
def test_openid1_1(self):
self.endpoint.type_uris = [discover.OPENID_1_1_TYPE]
self.assertFalse(self.endpoint.isOPIdentifier())
def test_openid2(self):
self.endpoint.type_uris = [discover.OPENID_2_0_TYPE]
self.assertFalse(self.endpoint.isOPIdentifier())
def test_openid2OP(self):
self.endpoint.type_uris = [discover.OPENID_IDP_2_0_TYPE]
self.assertTrue(self.endpoint.isOPIdentifier())
def test_multipleMissing(self):
self.endpoint.type_uris = [discover.OPENID_2_0_TYPE,
discover.OPENID_1_0_TYPE]
self.assertFalse(self.endpoint.isOPIdentifier())
def test_multiplePresent(self):
self.endpoint.type_uris = [discover.OPENID_2_0_TYPE,
discover.OPENID_1_0_TYPE,
discover.OPENID_IDP_2_0_TYPE]
self.assertTrue(self.endpoint.isOPIdentifier())
This whole test case is dedicated to this one method:
def isOPIdentifier(self):
return OPENID_IDP_2_0_TYPE in self.type_uris
That's right. Seven tests to test an in
operation on a list. This might be justifiable for a "black box" testing of unknown code. But here it shouldn't be more complicated than testing just two states in a single test.
Generated example
This code is more OpenID-specific but the problem is still the same. As soon as you start writing too much of not too different code you want to automate it and naturally come to generated tests:
@gentests
class Discover(unittest.TestCase):
data = [
("equiv", (True, "equiv", "equiv" , "xrds")),
("header", (True, "header", "header" , "xrds")),
("lowercase_header", (True, "lowercase_header", "lowercase_header" , "xrds")),
("xrds", (True, "xrds", "xrds" , "xrds")),
("xrds_ctparam", (True, "xrds_ctparam", "xrds_ctparam" , "xrds_ctparam")),
("xrds_ctcase", (True, "xrds_ctcase", "xrds_ctcase" , "xrds_ctcase")),
("xrds_html", (False, "xrds_html", "xrds_html" , "xrds_html")),
("redir_equiv", (True, "redir_equiv", "equiv" , "xrds")),
("redir_header", (True, "redir_header", "header" , "xrds")),
("redir_xrds", (True, "redir_xrds", "xrds" , "xrds")),
("redir_xrds_html", (False, "redir_xrds_html", "xrds_html" , "xrds_html")),
("redir_redir_equiv", (True, "redir_redir_equiv", "equiv" , "xrds")),
("404_server_response", (False, "404_server_response", None , None)),
("404_with_header", (False, "404_with_header", None , None)),
("404_with_meta", (False, "404_with_meta", None , None)),
("500_server_response", (False, "500_server_response", None , None)),
]
@mock.patch('openid.fetchers.fetch', fetch)
def _test(self, success, input_name, id_name, result_name):
input_url, expected = discoverdata.generateResult(
BASE_URL,
input_name,
id_name,
result_name,
success,
)
if expected is None:
self.assertRaises(urllib.error.HTTPError, discover, input_url)
else:
result = discover(input_url)
self.assertEqual(input_url, result.request_uri)
self.assertEqual(result.__dict__, expected.__dict__)
Actual test methods are generated from self.data
, each calling self._test
with provided arguments. The test function compares a generated expected result with a result returned from a real discover
function which is being tested.
It seems reasonable at first, but there's quite a few things wrong with it that nobody can see:
-
For starters, there are two distinct code paths — for successes and failures — that share nothing in common. Yes, even the seemingly "common" call to
generateResult
actually has its ownif
that returns completely different result for failures. -
Though these tests are grouped together, they in fact test very different parts of OpenID discovery process: request headers, response headers, HTML
<meta>
overrides, etc. But instead of being asserted directly those are implied by the success of the overall discovery process. Which makes debugging more complicated: instead of saying: "you don't recognize this content type", your test will say: "uhm... the discovery is broken somewhere". -
A few of those tests are completely useless — guess which ones? Redirects. Redirects are never even exposed to the client code, they're handled by the HTTP library. So what we're testing here is our own HTTP mock, not the discovery process.
-
Some of those tests hide actual bugs. For example tests 3, 4 and 5 should test, in theory, that we recognize an XRDS-formatted response even with slight variations in its Content-type header. But they don't even call the method
isXRDS()
responsible for this, they implicitly compare thecontent_type
attributs of the expected and the actual results that — and here's the best part — are being generated from the same test sample!
This last bit is, by the way, makes all of those tests rather useless. Probably out of desire to reuse more code both the generateResult()
and the mock fetch()
ultimately read the same data file.
Killer example
I won't even paste it here as it is too big.
It's a hierarchy of a base class and three descendants accompanied by two mock fetchers. Test methods assert many different things but use a single complicated generalized test function from the base. You probably want to have a look at this test function _checkService() and at The Boss itself, the class called TestDiscovery.
All tests work approximately the same way:
- call
discover
(with additional checks) - feed the result to
self._checkService
with a whole lot of control values that tell it which code path to take
In addition to all the problems that I already mentioned it has another rather obvious one: it's big. Apparently, because of the many subtle differences in individual tests the generalized testing code was becoming too complicated and made it impossible to converted this test case into a generated one. (This is just my hypothesis.)
Don't do carpet testing. It doesn't make tests "more complete" but makes it harder to reason about them, leading to hidden bugs.
Refactoring it
BOOOOORING!!!! It took me several days and a lot of patience :-).
Deconstruction of such a thing may seem an insurmountable task at first. And I made a few false starts trying to do too much too soon.
My general approach is to start reading through the file looking for obvious, easy to fix code smell, like moving a repeated function call into one place or killing a class attribute that does the job of a local variable. Dealing with those removes some code but more importantly gives you a better understanding of its scope and intentions.
Then you start noticing corner cases that differ from the general shape the most, remove their dependency on the common parts and then extract them. This groups uniform bits of code together and removes code dealing with the differences.
And then at some magical point you suddenly realize that all similar looking code is completely identical and then you just remove all the repetitions in one big swoop.
And never, ever try to "just rewrite" the entire thing from scratch!
urlopen mock
One thing that allowed me to kill a lot of custom mocking code is the generalized mock for urlopen()
that I now use in all the tests. It can serve regular files from the designated test directory with the correct Content-type
determined from extensions. You can also use query parameters to ask it for a specific status code or a response header:
query = {'status': 400, 'header': 'X-XRDS-Location: http://...'}
url = 'http://unittest/test-sample.html?' + urlencode(query)
I wonder if someone has already done that before?
Comments: 2
From what I see, almost all examples fall to one or another category of smell tests from "xUnit Test Patterns". Smells from the book are highlighted by italic.
The first one is something that I cannot find in the book. Usually it's other way around — people don't write tests. In this case it looks like tests are over-thought, from my point of view.
Second example is Assertion Roulette (you cannot find what fired assert just by looking into the test results) caused by Multiple Test Conditions (test tries to apply logic to many sets of input values) and Conditional Verification Logic (test runs different asserts depending on context).
Third one is probably Untestable Test Code (test is not simple enough to be sure it is implemented correct) + Eager Test (test verifies too much) + Conditional Verification Logic (test runs different asserts depending on context).
Clearly, you can improve tests there, but at first sight, there are not that bad. I've seen much-much worse bad-smelling tests. If you're still motivated to refactor tests then I highly recommend the book — it helped me a lot to articulate to other people what was exactly wrong with their tests :)
Thanks for the analysis and for the pointer! Indeed, all the cases are slightly different and the patterns you mentioned look spot on.
I also want to note that I'm in a kinda luxury position here: I don't have to convince anyone to change the code :-)