2014-07-18

What every developer should know about testing - part 1

The short version of this blog post is that this week and next week on pyVmomi will be spent on radically improved testing code and processes. This testing process will become part of the commit cycle. And, it's about damn time.

Once these measures are in place over the next few days the project will be better able to absorb new commits from interested parties.

Overview 

Over the last three weeks I've been working on pyVmomi's next release. If you've not been following along, it's a python client for a SOAP server and it's a code base dating back to at least Python 2.3 but it has only been in the wild as OpenSource now since December 2013. The pyVmomi library has a long internal history at VMware and has served many teams well over the years. But, it does have problems.

The OpenSource version of the library has *no* unit testing shipping with it this makes it hard for interested third parties to contribue. It's time to change that. But, it's a client library for a network service. How do you test such a beast?

In this post I will cover what unit testing is and what integration testing is and how this impacts the design choices made on libraries. This discussion is directly applicable to the long-term evolution of a library like pyVmomi and is generally applicable to Software Design. I'm bothering to write all this down because I want everyone to be on the same page and I don't particularly want to repeat myself too much. Over the years I expect to be involved with this project I expect to point at this post frequently.

The Problem

Work on pyVmomi has been rather painful. For much of it, I have had to spend vast amounts of time deep in the debugger. Testing on this library involves building a VCSIM and simulating a vCenter environment. This in turn means the creation of a suitable inventory and potentially setting up a suitable Fault to work with. This is a lot of yak shaving to get to the point you can even start considering doing development work.

The root of the specific problem

The problem in specific is that pyVmomi as a library speaks to a server. No other thing can completely simulate all the inputs, outputs, and exposed states that such a thing achieves except the big complex thing itself. This problem is routinely solved by developers in this entire cloud infrastructure space by spinning up virtual environments to create the scenarios they need.

This is a natural inclination since you have a beautiful hammer, why not nail all the bugs with this beautiful hammer? Virtualization is powerful and has transformed our industry. One day I will be an old man and tell stories of how infrastructure and development worked in the bad old days of Dot-Com Boom but this inclination is an example of the hype-cycle in full detrimental effect.

The problem in general

Because client library code development starts at the integration phase the units that end up defined by the client library programmer are inherently integrations. How do you test integrations? With integration tests. But, how do you do integration testing when the thing you are testing isn't even on your build machine? If it's a server (such as our case) you have to fall back to either a simulator or you have to stand up a whole new of your environment just for testing.

Unsurprisingly, this is fairly standard practice for every step of IaaS and PaaS development. You stand up the universe to author a new function then you retest the whole thing on a fresh copy of the universe. Then, you wash-rinse-repeat for the whole integrated system. It's so easy. It's also so very wrong. Because code that is hard to test (or completely untestable) in isolation is poorly designed. If you're defending the fact that it's tested, you're missing the point.



This isn't just a problem with the one library I'm working on now. I've seen this repeatedly in development environments of every kind at huge shops and tiny shops. You build up a pile-o-software that glues systems together and to test it you build a pile-o-infrastructure that you bring to a pile-o-state so you can validate the right calls and responses.

When you test this way (bringing a whole simulated universe into existence to test your new 'if' statement), invariably something's state gets out of sync and what do you do? You have to test the test environment to validate that you don't have false positives for your failure report, then you have to retest and you re-start the whole process which typically grows into hours. This is, frankly, an extremely expensive way to develop software.

And, for the record I've seen this in JEE, Spring, Grails, Python, Bash, Perl, C, C++, projects on Solaris, Linux, Irix, BSD, and now ESX based environments. This is not a problem unique to those crappy developers on that crappy environment. This is an intrinsic integration development problem that crops up when you routinely write code that takes big complex systems and makes them work together. It's a far too easy trap to fall into and a far too difficult of a pit to climb out of.

Unit Testing?

So the story so-far is that we have a library, maybe that library talks to things "not of this machine". Maybe, it speaks over the wire, talks to other things we can't see or directly control. These are things well outside of anything we could define as our unit of code. So if that's our fundamental unit (because what *else* is something like pyVmomi?) How the heck do we test it?


http://youtu.be/G2y8Sx4B2Sk

The term unit is deliberately ambiguous in this context. Did we mean class? Did we mean method? The answer is it depends. Getting the logical border of the unit is hard. It's actually human intelligence hard. It's "why AI do not yet write code" hard. Why is it hard? It's hard in the same what the making beautiful art is hard it's a fuzzy problem that requires aesthetics.

The definition of where a unit is, is hard and simultaneously critical to get right. Define the wrong unit and pay the price. This doesn't mean testing is wrong, it just means testing is a programming-hard problem. Looking for easy answers here just means you don't know what you're doing.
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
        — Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.

I mock your Mocks!

A simple answer to the problem is to Mock and Stub everything. (Mock objects are not Stubs, and you should know the difference. Edit: this is a whole topic unto itself and I will cover it separately at a later date.) The problem is when you work with a sufficiently complex interactions you will be forced to write sufficiently complex mocks and stubs. You will back your way into the simulator problem I mentioned before. In our specific case this means essentially re-inventing something like VCSIM except all in Python mock objects and that's absurd.

What are you forging, o' library author?

Consider also, where is the unit boundary when it comes to a client library? The library absolutely has a boundary at its public interfaces. If the bit-o-code you just wrote is private why did you write it? The private method is not part of the interface and so therefore it's not part of the library's unit definition. The unit in this context only makes sense as a test-first tested component if it's going to be exposed. By definition a private method isn't exposed so it's an implementation detail and we don't test our programs to make sure implementation details work. We don't test if the 'if' statement works. So where is the detail and where is the interface?

This means your test-first tests should be your surface. To develop a library that you intend on providing to people to interact with you should model sets of expected interactions. Each interaction should then be codified as a set of calls to the library.

Code as little as possible, test as little as possible

Tests are code. The mark of a good programmer is not how much code they write, but how much they can accomplish with how little. If you are following the aesthetic of minimalist code, then this attitude should also follow into your tests. Your tests should cover as many lines as is needed to validate the code and should do this as effectively as possible. Ideally, you should have an efficiency to your tests. No two tests should cover exactly the same unit. Covering a unit multiple times is effectively wasted effort.

This is a much harder philosophy and practice to follow than the lazy 'cover all the lines' strategy. It requires you to understand the functional cases your code can cover. In a rarified ideal world, this might mean you get 100% coverage anyway but the percentage isn't what we care about. You can have 100% code coverage and still have horrible comprehension of what your project even is.

Breaking down your tests to cover all the methods, even the private ones is a horrible idea. If you cover your private methods you will be tying your tests that matter to what (by defining them as private methods) you have decided are implementation details. That equals tight coupling; tight coupling is bad.

How do you test something where it provides very little function other than basic interactions with a service? How do you exercise a library that is arguably mostly private and hidden code?

Introducing Fixtures

The concept of a testing fixture is a very old testing concept. It even predates software as a thing and yet I rarely if ever see a shop using fixtures. The truly sad thing is that for most development languages fixtures are old as the hills. So, WHY are so few projects using them?

A Specific Solution: vcrpy

I reviewed several Python fixture libraries this week and was fairly well impressed with vcrpy for our purposes. The description may mention 'mocking' but function of the library is to provide you with testing fixtures at the socket level. In libraries like pyVmomi we are effectively a skin over a very complex back end web service. This 'skin' nature of ours means that a simple set of library interactions may hide dozens of network conversations.

Manually creating dozens of HTTP interaction mocks to explore a single high-level test can be so painful that you are likely to just not do it. Fortunately tools like vcrpy exist and can record your HTTP traffic. Now you can do the lazy thing and toy with your client-server a bit, record the on the wire interactions, and then later (and more importantly) edit the conversations to represent the larger API cases you want to cover.

With the recorded HTTP fixtures at our disposal we can now work with the binding in much more predictable and controlled ways.


More on that next week... (or skip to the end)