paulbellamy.com - Testing With Intent: Writing Quality Tests

Testing With Intent: Writing Quality Tests

14 December, 2018

In section 1 we saw that testing (like so many things) is all about getting the most return on investment. If you’re taking the time to write and maintain a comprehensive test suite, you should make sure your time is well spent. Often, the standard of a test suite decides whether it will be maintained, built up, and relied-upon.

In this post, we’ll look at ways to make sure your test-suite stays top-notch. That your teammates find it useful instead of a drag. First we’ll define quality, with a clear vision of our aims, we can ensure we’re winning. Then, we’ll look at test structure and naming, dependency management, and when to use mocks. Finally we’ll wrap up by talking about how to make sure your test suite starts fast, and stays fast.

Quantifying Quality

There is a natural analogy between tests and rock-climbing. Like pre-set hooks, tests lay out a path. Well named, and well-structured tests are little messages left by previous developers. Go this way! Look over here! They are also safety-nets on your journey. Checkpoints, where you can lock-on and ensure you never break essential functionality. The dual-nature of tests, as both a safety net, and a trail of breadcrumbs from the developer before you, shows us the essential principles we should expect.

Flexible

When you’re refactoring is when the safety-net is most essential. If the tests are too deeply coupled to your implementation, they will constantly be “in your way”. So, a top-notch test suite is as decoupled as possible from the specific implementation. That way we are not tempted to move refactor and move our safety net at the same time.
Reliable

Imagine that your safety-net drops you 5% of the time, or on Feb 29, or whenever daylight-savings-time comes into effect. You would never rely on that for real peace of mind. Just like in a CI pipeline, the damage done by flakey tests is difficult to overestimate. Or maybe the test suite takes too long to run. This is like a safety-net too far away. It’s still there, but you’re less willing to rely on it.
Understandable

While climbing, the path you are on ends without warning. Maybe it veers wildly off to the side, or the hooks are too far apart to be seen. Suddenly you are not on a route. You are off by yourself. Of course, the climb may continue, but you no longer have a guide. A test suite should be an understandable route-map, left behind to guide the next developer through your though-process. What constraints are on the system? What is the required behaviour? What are the things we don’t care about?

Like almost anything, most of these principles come with intentional directed practice. Thoughtful introspection when writing tests (and code) will help us improve over time. But, let’s look at some ways we can improve our tests immediately.

Arrange, Act, Assert

The first, and simplest, way to improve understandability is to have structure. Almost all tests can be broken down into 3 “stages”. First, we arrange our system and test data. Second, we act upon the system. Finally, we assert that the outcome was what we expected. In a real test that would look like:

func TestReverse(t *testing.T) {
  // Arrange
  input := "abc"

  // Act
  result := Reverse(input)

  // Assert
  assert.Equal("cba", result)
}

In this case, the arrange step is just defining our input. The act step is simply calling our pure function. The assert step is an equality check on the output.

Unrelated assertions

Importantly, it’s not: arrange, act, assert, act, assert, etc. Each test should (ideally) have a single action. When you have many actions and assertions in a single test it becomes much more difficult to determine the * point * of the test. For example, this test is actually two separate tests, combined into one.

func TestLoggedInUsersCanTweet(t *testing.T) {
  // Arrange
  server := NewServer()
  user := NewUser()
  
  // Act
  server.LoginUser(user)
  // Act (again)
  server.Tweet(user, "hello, world!")

  // Assert
  assert.Equal(user.Tweets(), []string{"hello, world!"})
  // Assert (something unrelated)
  assert.Equal(server.Uptime(), 1 * time.Minute)
}

What is the goal, property, or behaviour we are aiming to describe? This often becomes an issue with end-to-end, or acceptance tests, where some login and user setup is essential. If this is an issue, it is best to extract the login and setup steps into helper functions. That helps keep the test purpose clear.

In this case, we could split this test into three separate tests and a helper, each with a clear purpose:

func TestServerLaunches(t *testing.T) {
  // Arrange
  server := NewServer()
  
  // Act
  server.Launch()

  // Assert
  assert.Equal(server.Uptime(), 1 * time.Minute)
}

func TestUsersCanLogIn(t *testing.T) {
  // Arrange
  server := NewServer()
  user := NewUser()
  
  // Act
  server.LoginUser(user)

  // Assert
  assert.Equal(server.LoggedInuser(), user)
}

func TestLoggedInUsersCanTweet(t *testing.T) {
  // Arrange
  server := NewServer()
  user := NewLoggedInUser(server) // A setup-helper 
  
  // Act
  server.Tweet(user, "hello, world!")

  // Assert
  assert.Equal(user.Tweets(), []string{"hello, world!"})
}

This structure helps keep tests clear, tidy, and focused.

Add a clue to the tests intent in the name

A test’s name should have a clue towards the intent. Best if the test is named specifically for the goal we are testing. For example, TestUserCanPostTweets is a great test name. It is specific about the expected outcome. On the other hand, TestUserTweeterIsCorrect, or TestServerWorks are both too vague too be useful. What is “correct”? What does “works” mean in this context?

A helpful rule for naming tests is: subject, action, assertion. What is the subject we are testing? What is the action we are taking? And what is the expected outcome? If we have all three of these in the test name, we are on the right track.

Non-determinism

One issue which often causes unreliable tests is non-determinism in the system-under-test.

If we want to apply a discount only on mondays, we could call time.Now(), look at what day it is, and voila! Suppose I wrote this code on monday. Of course, I can say, “Works on my machine!”

func TestApplyMondayDiscount(t *testing.T) {
  input := NewOrder()
  
  result := ApplyMondayDiscount(input)

  assert.Equal(result.Discount, 0.10)
}

func ApplyMondayDiscount(order Order) Order {
  if time.Now().Weekday() == time.Monday {
    return order.ApplyDiscount(0.10)
  }
  return order
}

But, these tests will start failing tomorrow. There are two ways to fix this test (and code). Option one would be to use a mock clock. To inject a spy into our test. But that avoids the real issue with the code. The underlying fault is that our code has a hidden dependency. It depends on the global system clock. Non-determinism is a symptom of impure functions, so we can fix it by making our function pure. In this case, that means turning the secret state into an input.

func TestApplyMondayDiscount(t *testing.T) {
  now := time.Date(2018, 1, 21, 9, 0, 0, 0, time.UTC)
  input := NewOrder()
  
  result := ApplyMondayDiscount(now, input)

  assert.Equal(result.Discount, 0.10)
}

func ApplyMondayDiscount(now time.Time, order Order) Order {
  if now.Weekday() == time.Monday {
    return order.ApplyDiscount(0.10)
  }
  return order
}

Immediately, we find two more tests we should write: not-monday, and a different timezone. In this case improving our test has also improved our code.

Non-determinism is really damaging if you leave it unchecked. False-negatives, weaken the percieved usefulness of a test suite. A test suite which is “failing when it shouldn’t” is in dangerous territory. The bonus is, when you fix these weird bugs you look like a hero!

In general, if you have non-deterministic (i.e. flakey) tests, it means you have impure functions. Typically these come up with uses of sleep, randomness, February (short month, and leap years), and with other globals and singletons.

Dependencies

The sources of non-determinism, sleep and randomness, are just one instance of an external dependency. But other dependencies, like databases, and external APIs are easier to identify.

When testing with external dependencies, it’s tempting to reach for a mock. Maybe we could stub an http client with an http response like this:

func FetchAndSortTweets(client http.Client) {}

func TestTweetSorter(t *testing.T) {
  client := http.Client{}
  client.
    Stub(http.Get).
    With("https://api.twitter.com/api/v1/tweets").
    Returning(`{"tweets":[{"id":1}]}`)

  sorted := FetchAndSortTweets(client)
  ...
}

But, there’s a rule to follow when testing external dependencies. Don’t mock it unless you own it. We don’t “own” the twitter API. When the api changes, we’ll have to update all our tests.

To “own” the interface, we wrap and abstract the dependency behind an interface.

type TwitterClient interface {
  FetchTweets() []Tweet
}

func FetchAndSortTweets(client TwitterClient) {}

func TestTweetSorter(t *testing.T) {
  client := NewStubTwitterClient()
  client.
    Stub(client.FetchTweets).
    Returning([]Tweet\{\{ID: 1\}\})
  sorted := FetchAndSortTweets(client)
  ...
}

Here we’ve added a TwitterClient interface. Then we can safely stub that interface to test against. This decouples our tests from the details of the twitter api, and means we can write tests specifically for our Twitter client, separately from the rest of our tests.

The same applies to other dependencies. Databases, and things like that. Some libraries provide good interfaces already, so you don’t have to do it yourself, but not all.

Doubles / Stubs / Fakes / Mocks

Speaking of doubles, stubs, fakes, and mocks, what are they for?

Good uses can include:

API Wrappers
Around legacy code
When absolutely necessary
Things which are hard to set up (and you can’t fix)

When used well, they can isolate tests from each other. When used poorly they can couple the test to the code implementation, making refactoring harder, and the test weaker. Worst of all, they can lead to unrealistic tests which don’t break when assumptions change. But, because mocks are simple your tests will involve less code overall, so will run faster.

Because of these limitations, I prefer using as few mocks as possible. Just my opinion.

Why do tests get slow?

Combinatorial Explosion

The number of tests and complexity always grows faster than your code. \(N\) interacting bits of code will have up to \(N^2\) connections. If you want to test all interactions and combinations you need \(N^2\) tests.

\[N Components = N^2\ Interactions\]

Firstly, this means you can’t possibly test all combinations and interactions. Secondly, it means that to test everything you would need far more tests than the underlying code. Even if we could test everything the test suite would be huge, unwieldy, and slow.

You’ll know this is the cause, if your test suite is slow because you have too many tests. Each individual test might be fast, but there are simply too many of them. If each test involves too many (say 3 or more) “moving parts” each test will be slower, and the number of test-cases will explode.

Tests are too big

Like in the first one, the combinatorial explosion leads us into a familiar trap. Test setup is painful and slow, so we start combining tests. We put a few “related” assertions into one test. We start relying more and more on integration tests.

Because there are too many components our tests become too high-level. But, more important than the number of components is the amount of coupling between the components. Are the interaction points clearly defined? Or are there so many ways for one component to affect another that they are completely entangled?

One way of managing this is with a well chosen ratio of unit to integration testing. Changing that ratio while maintaing quality of tests can push to design changes. The general rule that gets quoted is 10x unit tests for every integration test. Because integration tests involve more components, they will run slower. To manage that we have fewer of them, just enough to test the wiring, and rely more heavily on isolated unit tests. Of course, the ratios for this depend on the project.

Combinatorial explosion can also be managed with mindful changes to the program architecture. By being explicit about the components and interfaces between them, the number of interaction-points and test cases can be tamed. Of the two approaches, this is much harder, but also yields the biggest long-term gains.