Saturday, 25 October 2014

It's hard to test software: even simple software!

Tetris is one of the best-known computer games ever made. It's easy to play but hard to master, and it's based on a NP-hard problem.

But that's not all that's difficult about it. Though it's a simple game that can be implemented in one line of BBC BASIC, it's complex enough to be really hard to thoroughly test.

Ideally, a game tester has to try every possible action, in order to be sure that the game works correctly whatever the player does. But even in a simple game, there is so much to test!

Recently my employer Rapita Systems released a tool demo in the form of a modified game of Tetris. Unlike "normal" Tetris, the goal is not to get a high score by clearing blocks, but rather to get a high code coverage score. To get the perfect score, you have to cause every part of the game's source code to execute. When a statement or a function executes during a test, we say it is "covered" by that test.
Trying to test the game of Tetris is not as easy as it seems.

About 70% of the code statements within this version of Tetris run all the time. Play for a few seconds, and you've tested most of the code. It's the remaining 30% that's hard to find.

Unless you use a coverage tool, you may not even be aware that this code hasn't been tested. You'll know that some parts probably haven't been tested very well - that's common sense. But until the coverage tool tells you, you don't know.

With a coverage tool, you can solve the puzzle. It tells you where coverage is missing, so you can figure out what more is needed. It's not an easy puzzle. It combines programming and debugging skills with knowledge of the game.

A few minutes playing this Tetris should be enough to convince you that it's non-trivial to get full coverage. Intuitively, you'd expect to get it just by playing for long enough, but you'd have to be very lucky, because some of the cases you need to test are extremely rare.

For example, here's one case you're unlikely to hit by accident. Here is part of the coverage report (produced when you press F4):

This procedure, updateScore4, is called when a block of 4 lines are cleared at once. Your score increases proportionally to the game's level (line 325). That's not uncommon, but there's something else here. If you clear 4 lines twice in a row, then the statement on line 327 is also activated. Your score is increased by a big bonus. It's a combo. There are equivalent procedures for 3 lines, 2 lines and 1 line.

This is not the only unusual special case in Tetris. For instance, there are special cases for rotating a block when it's next to a wall ("wall kick") or next to other blocks.

Full coverage means executing all of these cases.

If you're testing a game, you may have played it for hours, days, or even weeks, and still not have covered all of the unusual, rarely-reached special cases. What if line 327 has a critical bug that crashes the game, or resets the score? Perhaps one player in a hundred will occasionally notice it, and may not even know what happened.

The coverage report tells you what's left to test, so it leads you straight there.

At Rapita, we were thinking about nice ways to show software developers why code coverage is a hard problem that requires good tools. We thought it had to be a game of some sort, preferably a highly recognisable and well-known game that anyone could pick up and play. We ran through some possibilities, and Tetris was chosen because it's simple, very well-known, and yet it's very hard to test.

Coverage testing is not usually applied to games. Our customers tend to be makers of aircraft or car parts. Both businesses have strict safety standards which involve coverage testing, and our tools help you produce the relevant reports for certification, like DO-178B for the aviation industry.

However, all software can benefit from coverage testing, safety critical or not. How else do you know what remains to be tested? How else do you know where bugs may be lurking, in rarely-used code? As a programmer I would like to think my code is bug free, but I know it often isn't, and I want to know about bugs now, rather than next week.

There are quick-and-dirty ways to tell if particular functions or statements are covered. If you are a programmer, you can probably think of a few. But a real debugger is preferable to filling your program with "printf" statements; and likewise, a real coverage tool is preferable to filling a program with other ways to detect coverage, such as breakpoints and log messages ("Tested function main()!"). A proper coverage tool saves you lots of time, and also does the sorts of coverage that are non-trivial to implement such as MC/DC.

I had thought that it would be easy to get complete coverage of Tetris. I was quite wrong. The example demonstrates an old lesson about programming: simplicity is an illusion, and you typically spend more time debugging and testing than actually writing code. Try it out for yourself! Download the demo here, and tell us if you get a perfect score.


  1. That's a very cool idea! Would be nice for an introductory CS course to repeat this with even simpler games. Thanks for sharing it!

    1. I agree, and if a professor were interested in using this within a course, I'm sure we'd be happy to help.

  2. Good to see! At Crytek, I used a similar approach for testing game AI on a large codebase, but with explicit markers rather than automatic coverage metrics, more feedback for testers and more tool support. One place you can read some details is

  3. 100% code coverage is the execution of every machine language instruction through possible every path to that instruction, and having reproduced every possible configuration of data bits in memory at every value of the program counter. What percentage coverage do you have?

    Individual statement executions don't generate value from code; execution paths do. Function, statement and branch coverage don't tell you much about the sequences of execution. Statement coverage tells you almost nothing about your path coverage. You would get the same result if you executed every statement in random order, independent of its execution context. (By the way, unit testing is a rough approximation of the same thing, relative to system testing.) For any reasonable semantics of functional assurance you need to use slicing techniques.

    Testing isn't about tools, but about thinking and about continuous and relentless exploration. Read Bach. Read Weinberg. Think. And please don't teach students this stuff. We need to teach them to think, not to run tools.

    1. The definition of code coverage that I have used here is the generally accepted one. It is not claimed here, or anywhere else, that 100% code coverage means that a program has no bugs and has been fully tested: no practical technique is capable of such a thing.

      It's been a long time since I taught any students anything, but I'd say that coverage testing absolutely could form part of a software engineering course. I am sure that any CS professor would be well qualified to point out the disadvantages of these techniques as well as their benefits.

    2. I am not impressed that what you say aligns with common perception, or, in this case, with common misplaced hope. It is simply wasteful. If you want to impress an academic (I can play the academic game very well and have the qualifications to sustain it) give me a formal argument that it works. Give me data. Give me even a compelling model of why it works at all, and then of how well it works relative to just reading code. My point is that developers' mental models of coverage don't correlate to program complexity and, as such, lead to far overblown expectations about coverage.

      Can you prove that 100% code coverage as defined here will in fact find any bugs, given that many are present? Can you justify the orthogonality of mental models of coding and coverage testing that contribute to the elimination of confirmation bias?

      I have framed out some models here that don't bode well for coverage. The information from line and branch coverage gives you infinitesimal information about how well you have explored the program.

      As with many other techniques, I suspect that most of the benefit comes from just getting the developer into the code to reason about it. If we asked them to study the average length of variable names in the code it's likely to do just as well.

      Bring your real arguments and I'll be attentive. But I will not be a lemming. :-)

  4. (By the way, this blog site seems to be very poorly tested. It just silently discarded an entire post of mine because of poor interaction between the login logic and the posting logic. I'm sure that each one enjoyed 100% coverage testing.)

  5. Sadly test coverage is just a number an says nothing about test quality one obvious flaw is test data. I could test every line with meaningless data. Also output could be wrong and I still get 100%. I agree with Cope. Teach this as an example if things you shouldn't trust.

    1. Who said that test coverage was a guarantee of test quality? The "obvious" flaw that you and Cope both pointed out is, erm, obvious.

      You can trust our coverage tool to do exactly what it says it will do.

    2. Do we have any evidence that it is even an indicator of the exploration of program behaviour? What it says it will do (by formal reduction of the technique) is explore an infinitesimal fraction of that space.

      I can hardly contain my joy at for what this portends for quality.

  6. RE "Can you prove that 100% code coverage as defined here will in fact find any bugs, given that many are present?" If the code is never executed, then have no chance of finding a bug in the ( never executed ) code. ( However you point is accepted, that just because code is executed or have 100% coverage then bugs are discovered.. )

    1. I think you contradict yourself here. Can you clarify?

  7. I'd like to explain how code coverage is used in safety critical software systems. There it's pretty mandatory, but used in a way that's not obvious at first.

    For DO-178B (guidelines for airborne software) you do requirements-based implementation and testing. That is, you write your code AND tests from a requirements specification. And by implication, the tests should exercise all the code. If this is not the case then you have either written something that's not in the requirements, or your tests are not adequately testing your requirements.

    In other words, coverage is not about finding bugs, but it's about making sure your tests are adequate (in some sense). Imagine a triangle: requirements, code, tests. The coverage is a process that provides evidence that these three are joined together.

    Remember that it's the test suite which determines if the code is executing "correctly" (as per specification). There's a whole set of good practice for how to write tests, particularly about how to pick up data dependent behaviour (e.g. corner cases and so on).

    In safety critical code, the concept of "testable" code is important. Coding guidelines and so on can help to ensure that the implementation does not contain funnies that are hard to test. So, if there is a branch in the code that causes some data dependent execution, this is in the requirements, and hence the tests and the code.

    As for defining "tests are adequate" - this pretty much comes down to a cost/benefit tradeoff, which depends on the criticality of the software (i.e. defined by consequences of failure). For something with a low criticality, 100% statement coverage is a reasonable level of coverage ("There is no line of code that hasn't been tested by my carefully designed test suite"). For high criticality software, harder metrics like MC/DC has been shown to be beneficial in improving testing.

    The (really nice!) Tetris app shows MC/DC test coverage, which is required for DO-178B level A (the most critical assurance level) and now in automotive software because of ISO 26262. MC/DC is basically making sure that as well as testing every statement, you also make sure that each part of a decision independently affects the branch outcome.

    This is how it's worked for a long time in aerospace, and there aren't many planes falling out the sky because of software bugs (although requirements specification issues, perhaps!)

    It's also worth commenting a bit on the philosophy of critical software processes. DO-178B for example does not guarantee a perfect implementation, but provides a set of processes that makes sure that you try really hard and think about things carefully. Ensuring that you get MC/DC coverage from your test suite is one of the (many) things that you could do to get your test suite up to scratch.