14  Designing your test suite

Important

Your test files should not include these library() calls. We also explicitly request testthat edition 3, but in a real package this will be declared in DESCRIPTION.

14.1 What to test

Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead. — Martin Fowler

There is a fine balance to writing tests. Each test that you write makes your code less likely to change inadvertently; but it also can make it harder to change your code on purpose. It’s hard to give good general advice about writing tests, but you might find these points helpful:

  • Focus on testing the external interface to your functions - if you test the internal interface, then it’s harder to change the implementation in the future because as well as modifying the code, you’ll also need to update all the tests.

  • Strive to test each behaviour in one and only one test. Then if that behaviour later changes you only need to update a single test.

  • Avoid testing simple code that you’re confident will work. Instead focus your time on code that you’re not sure about, is fragile, or has complicated interdependencies. That said, I often find I make the most mistakes when I falsely assume that the problem is simple and doesn’t need any tests.

  • Always write a test when you discover a bug. You may find it helpful to adopt the test-first philosophy. There you always start by writing the tests, and then write the code that makes them pass. This reflects an important problem solving strategy: start by establishing your success criteria, how you know if you’ve solved the problem.

14.1.1 Test coverage

Another concrete way to direct your test writing efforts is to examine your test coverage. The covr package (https://covr.r-lib.org) can be used to determine which lines of your package’s source code are (or are not!) executed when the test suite is run. This is most often presented as a percentage. Generally speaking, the higher the better.

In some technical sense, 100% test coverage is the goal, however, this is rarely achieved in practice and that’s often OK. Going from 90% or 99% coverage to 100% is not always the best use of your development time and energy. In many cases, that last 10% or 1% often requires some awkward gymnastics to cover. Sometimes this forces you to introduce mocking or some other new complexity. Don’t sacrifice the maintainability of your test suite in the name of covering some weird edge case that hasn’t yet proven to be a problem. Also remember that not every line of code or every function is equally likely to harbor bugs. Focus your testing energy on code that is tricky, based on your expert opinion and any empirical evidence you’ve accumulated about bug hot spots.

We use covr regularly, in two different ways:

  • Local, interactive use. We mostly use devtools::test_coverage_active_file() and devtools::test_coverage(), for exploring the coverage of an individual file or the whole package, respectively.
  • Automatic, remote use via GitHub Actions (GHA). We cover continuous integration and GHA more thoroughly elsewhere, but we will at least mention here that usethis::use_github_action("test-coverage") configures a GHA workflow that constantly monitors your test coverage. Test coverage can be an especially helpful metric when evaluating a pull request (either your own or from an external contributor). A proposed change that is well-covered by tests is less risky to merge.

14.2 High-level principles for testing

In later sections, we offer concrete strategies for how to handle common testing dilemmas in R. Here we lay out the high-level principles that underpin these recommendations:

  • A test should ideally be self-sufficient and self-contained.
  • The interactive workflow is important, because you will mostly interact with your tests when they are failing.
  • It’s more important that test code be obvious than, e.g., as DRY as possible.
  • However, the interactive workflow shouldn’t “leak” into and undermine the test suite.

Writing good tests for a code base often feels more challenging than writing the code in the first place. This can come as a bit of a shock when you’re new to package development and you might be concerned that you’re doing it wrong. Don’t worry, you’re not! Testing presents many unique challenges and maneuvers, which tend to get much less air time in programming communities than strategies for writing the “main code”, i.e. the stuff below R/. As a result, it requires more deliberate effort to develop your skills and taste around testing.

Many of the packages maintained by our team violate some of the advice you’ll find here. There are (at least) two reasons for that:

  • testthat has been evolving for more than twelve years and this chapter reflects the cumulative lessons learned from that experience. The tests in many packages have been in place for a long time and reflect typical practices from different eras and different maintainers.
  • These aren’t hard and fast rules, but are, rather, guidelines. There will always be specific situations where it makes sense to bend the rule.

This chapter can’t address all possible testing situations, but hopefully these guidelines will help your future decision-making.

14.2.1 Self-sufficient tests

All tests should strive to be hermetic: a test should contain all of the information necessary to set up, execute, and tear down its environment. Tests should assume as little as possible about the outside environment ….

From the book Software Engineering at Google, Chapter 11

Recall this advice found in Section 7.6, which covers your package’s “main code”, i.e. everything below R/:

The .R files below R/ should consist almost entirely of function definitions. Any other top-level code is suspicious and should be carefully reviewed for possible conversion into a function.

We have analogous advice for your test files:

The test-*.R files below tests/testthat/ should consist almost entirely of calls to test_that(). Any other top-level code is suspicious and should be carefully considered for relocation into calls to test_that() or to other files that get special treatment inside an R package or from testthat.

Eliminating (or at least minimizing) top-level code outside of test_that() will have the beneficial effect of making your tests more hermetic. This is basically the testing analogue of the general programming advice that it’s wise to avoid unstructured sharing of state.

Logic at the top-level of a test file has an awkward scope: Objects or functions defined here have what you might call “test file scope”, if the definitions appear before the first call to test_that(). If top-level code is interleaved between test_that() calls, you can even create “partial test file scope”.

While writing tests, it can feel convenient to rely on these file-scoped objects, especially early in the life of a test suite, e.g. when each test file fits on one screen. But we find that implicitly relying on objects in a test’s parent environment tends to make a test suite harder to understand and maintain over time.

Consider a test file with top-level code sprinkled around it, outside of test_that():

dat <- data.frame(x = c("a", "b", "c"), y = c(1, 2, 3))

skip_if(today_is_a_monday())

test_that("foofy() does this", {
  expect_equal(foofy(dat), ...)
})

dat2 <- data.frame(x = c("x", "y", "z"), y = c(4, 5, 6))

skip_on_os("windows")

test_that("foofy2() does that", {
  expect_snapshot(foofy2(dat, dat2)
})

We recommend relocating file-scoped logic to either a narrower scope or to a broader scope. Here’s what it would look like to use a narrow scope, i.e. to inline everything inside test_that() calls:

test_that("foofy() does this", {
  skip_if(today_is_a_monday())
  
  dat <- data.frame(x = c("a", "b", "c"), y = c(1, 2, 3))
  
  expect_equal(foofy(dat), ...)
})

test_that("foofy() does that", {
  skip_if(today_is_a_monday())
  skip_on_os("windows")
  
  dat <- data.frame(x = c("a", "b", "c"), y = c(1, 2, 3))
  dat2 <- data.frame(x = c("x", "y", "z"), y = c(4, 5, 6))
  
  expect_snapshot(foofy(dat, dat2)
})

Below we will discuss techniques for moving file-scoped logic to a broader scope.

14.2.2 Self-contained tests

Each test_that() test has its own execution environment, which makes it somewhat self-contained. For example, an R object you create inside a test does not exist after the test exits:

exists("thingy")
#> [1] FALSE

test_that("thingy exists", {
  thingy <- "thingy"
  expect_true(exists(thingy))
})
#> Test passed 🥳

exists("thingy")
#> [1] FALSE

The thingy object lives and dies entirely within the confines of test_that(). However, testthat doesn’t know how to cleanup after actions that affect other aspects of the R landscape:

  • The filesystem: creating and deleting files, changing the working directory, etc.
  • The search path: library(), attach().
  • Global options, like options() and par(), and environment variables.

Watch how calls like library(), options(), and Sys.setenv() have a persistent effect after a test, even when they are executed inside test_that():

grep("jsonlite", search(), value = TRUE)
#> character(0)
getOption("opt_whatever")
#> NULL
Sys.getenv("envvar_whatever")
#> [1] ""

test_that("landscape changes leak outside the test", {
  library(jsonlite)
  options(opt_whatever = "whatever")
  Sys.setenv(envvar_whatever = "whatever")
  
  expect_match(search(), "jsonlite", all = FALSE)
  expect_equal(getOption("opt_whatever"), "whatever")
  expect_equal(Sys.getenv("envvar_whatever"), "whatever")
})
#> Test passed 🥇

grep("jsonlite", search(), value = TRUE)
#> [1] "package:jsonlite"
getOption("opt_whatever")
#> [1] "whatever"
Sys.getenv("envvar_whatever")
#> [1] "whatever"

These changes to the landscape even persist beyond the current test file, i.e. they carry over into all subsequent test files.

If it’s easy to avoid making such changes in your test code, that is the best strategy! But if it’s unavoidable, then you have to make sure that you clean up after yourself. This mindset is very similar to one we advocated for in Section 7.6, when discussing how to design well-mannered functions.

We like to use the withr package (https://withr.r-lib.org) to make temporary changes in global state, because it automatically captures the initial state and arranges the eventual restoration. You’ve already seen an example of its usage, when we explored snapshot tests:

test_that("side-by-side diffs work", {
  withr::local_options(width = 20)             # <-- (°_°) look here!
  expect_snapshot(
    waldo::compare(c("X", letters), c(letters, "X"))
  )
})

This test requires the display width to be set at 20 columns, which is considerably less than the default width. withr::local_options(width = 20) sets the width option to 20 and, at the end of the test, restores the option to its original value. withr is also pleasant to use during interactive development: deferred actions are still captured on the global environment and can be executed explicitly via withr::deferred_run() or implicitly by restarting R.

We recommend including withr in Suggests, if you’re only going to use it in your tests, or in Imports, if you also use it below R/. Call withr functions as we do above, e.g. like withr::local_whatever(), in either case. See Section 11.2.1.1 for a full discussion.

Tip

The easiest way to add a package to DESCRIPTION is with, e.g., usethis::use_package("withr", type = "Suggests"). For tidyverse packages, withr is considered a “free dependency”, i.e. the tidyverse uses withr so extensively that we don’t hesitate to use it whenever it would be useful.

withr has a large set of pre-implemented local_*() / with_*() functions that should handle most of your testing needs, so check there before you write your own. If nothing exists that meets your need, withr::defer() is the general way to schedule some action at the end of a test.1

Here’s how we would fix the problems in the previous example using withr: Behind the scenes, we reversed the landscape changes, so we can try this again.

grep("jsonlite", search(), value = TRUE)
#> character(0)
getOption("opt_whatever")
#> NULL
Sys.getenv("envvar_whatever")
#> [1] ""

test_that("withr makes landscape changes local to a test", {
  withr::local_package("jsonlite")
  withr::local_options(opt_whatever = "whatever")
  withr::local_envvar(envvar_whatever = "whatever")
  
  expect_match(search(), "jsonlite", all = FALSE)
  expect_equal(getOption("opt_whatever"), "whatever")
  expect_equal(Sys.getenv("envvar_whatever"), "whatever")
})
#> Test passed 😀

grep("jsonlite", search(), value = TRUE)
#> character(0)
getOption("opt_whatever")
#> NULL
Sys.getenv("envvar_whatever")
#> [1] ""

testthat leans heavily on withr to make test execution environments as reproducible and self-contained as possible. In testthat 3e, testthat::local_reproducible_output() is implicitly part of each test_that() test.

test_that("something specific happens", {
  local_reproducible_output()     # <-- this happens implicitly
  
  # your test code, which might be sensitive to ambient conditions, such as
  # display width or the number of supported colors
})

local_reproducible_output() temporarily sets various options and environment variables to values favorable for testing, e.g. it suppresses colored output, turns off fancy quotes, sets the console width, and sets LC_COLLATE = "C". Usually, you can just passively enjoy the benefits of local_reproducible_output(). But you may want to call it explicitly when replicating test results interactively or if you want to override the default settings in a specific test.

14.2.3 Plan for test failure

We regret to inform you that most of the quality time you spend with your tests will be when they are inexplicably failing.

In its purest form, automating testing consists of three activities: writing tests, running tests, and reacting to test failures….

Remember that tests are often revisited only when something breaks. When you are called to fix a broken test that you have never seen before, you will be thankful someone took the time to make it easy to understand. Code is read far more than it is written, so make sure you write the test you’d like to read!

From the book Software Engineering at Google, Chapter 11

Most of us don’t work on a code base the size of Google. But even in a team of one, tests that you wrote six months ago might as well have been written by someone else. Especially when they are failing.

When we do reverse dependency checks, often involving hundreds or thousands of CRAN packages, we have to inspect test failures to determine if changes in our packages are to blame. As a result, we regularly engage with failing tests in other people’s packages, which leaves us with lots of opinions about practices that create unnecessary testing pain.

Test troubleshooting nirvana looks like this: In a fresh R session, you can do devtools::load_all() and immediately run an individual test or walk through it line-by-line. There is no need to hunt around for setup code that has to be run manually first, that is found elsewhere in the test file or perhaps in a different file altogether. Test-related code that lives in an unconventional location causes extra self-inflicted pain when you least need it.

Consider this extreme and abstract example of a test that is difficult to troubleshoot due to implicit dependencies on free-range code:

# dozens or hundreds of lines of top-level code, interspersed with other tests,
# which you must read and selectively execute

test_that("f() works", {
  x <- function_from_some_dependency(object_with_unknown_origin)
  expect_equal(f(x), 2.5)
})

This test is much easier to drop in on if dependencies are invoked in the normal way, i.e. via ::, and test objects are created inline:

# dozens or hundreds of lines of self-sufficient, self-contained tests,
# all of which you can safely ignore!

test_that("f() works", {
  useful_thing <- ...
  x <- somePkg::someFunction(useful_thing)
  expect_equal(f(x), 2.5)
})

This test is self-sufficient. The code inside { ... } explicitly creates any necessary objects or conditions and makes explicit calls to any helper functions. This test doesn’t rely on objects or dependencies that happen to be be ambiently available.

Self-sufficient, self-contained tests are a win-win: It is literally safer to design tests this way and it also makes tests much easier for humans to troubleshoot later.

14.2.4 Repetition is OK

One obvious consequence of our suggestion to minimize code with “file scope” is that your tests will probably have some repetition. And that’s OK! We’re going to make the controversial recommendation that you tolerate a fair amount of duplication in test code, i.e. you can relax some of your DRY (“don’t repeat yourself”) tendencies.

Keep the reader in your test function. Good production code is well-factored; good test code is obvious. … think about what will make the problem obvious when a test fails.

From the blog post Why Good Developers Write Bad Unit Tests

Here’s a toy example to make things concrete.

test_that("multiplication works", {
  useful_thing <- 3
  expect_equal(2 * useful_thing, 6)
})
#> Test passed 🎉

test_that("subtraction works", {
  useful_thing <- 3
  expect_equal(5 - useful_thing, 2)
})
#> Test passed 🎉

In real life, useful_thing is usually a more complicated object that somehow feels burdensome to instantiate. Notice how useful_thing <- 3 appears in more than once place. Conventional wisdom says we should DRY this code out. It’s tempting to just move useful_thing’s definition outside of the tests:

useful_thing <- 3

test_that("multiplication works", {
  expect_equal(2 * useful_thing, 6)
})
#> Test passed 🥳

test_that("subtraction works", {
  expect_equal(5 - useful_thing, 2)
})
#> Test passed 🎉

But we really do think the first form, with the repetition, is often the better choice.

At this point, many readers might be thinking “but the code I might have to repeat is much longer than 1 line!”. Below we describe the use of test fixtures. This can often reduce complicated situations back to something that resembles this simple example.

14.2.5 Remove tension between interactive and automated testing

Your test code will be executed in two different settings:

Automated testing of your whole package is what takes priority. This is ultimately the whole point of your tests. However, the interactive experience is clearly important for the humans doing this work. Therefore it’s important to find a pleasant workflow, but also to ensure that you don’t rig anything for interactive convenience that actually compromises the health of the test suite.

These two modes of test-running should not be in conflict with each other. If you perceive tension between these two modes, this can indicate that you’re not taking full advantage of some of testthat’s features and the way it’s designed to work with devtools::load_all().

When working on your tests, use load_all(), just like you do when working below R/. By default, load_all() does all of these things:

  • Simulates re-building, re-installing, and re-loading your package.
  • Makes everything in your package’s namespace available, including unexported functions and objects and anything you’ve imported from another package.
  • Attaches testthat, i.e. does library(testthat).
  • Runs test helper files, i.e. executes test/testthat/helper.R (more on that below).

This eliminates the need for any library() calls below tests/testthat/, for the vast majority of R packages. Any instance of library(testthat) is clearly no longer necessary. Likewise, any instance of attaching one of your dependencies via library(somePkg) is unnecessary. In your tests, if you need to call functions from somePkg, do it just as you do below R/. If you have imported the function into your namespace, use fun(). If you have not, use somePkg::fun(). It’s fair to say that library(somePkg) in the tests should be about as rare as taking a dependency via Depends, i.e. there is almost always a better alternative.

Unnecessary calls to library(somePkg) in test files have a real downside, because they actually change the R landscape. library() alters the search path. This means the circumstances under which you are testing may not necessarily reflect the circumstances under which your package will be used. This makes it easier to create subtle test bugs, which you will have to unravel in the future.

One other function that should almost never appear below tests/testhat/ is source(). There are several special files with an official role in testthat workflows (see below), not to mention the entire R package machinery, that provide better ways to make functions, objects, and other logic available in your tests.

14.3 Files relevant to testing

Here we review which package files are especially relevant to testing and, more generally, best practices for interacting with the file system from your tests.

14.3.1 Hiding in plain sight: files below R/

The most important functions you’ll need to access from your tests are clearly those in your package! Here we’re talking about everything that’s defined below R/. The functions and other objects defined by your package are always available when testing, regardless of whether they are exported or not. For interactive work, devtools::load_all() takes care of this. During automated testing, this is taken care of internally by testthat.

This implies that test helpers can absolutely be defined below R/ and used freely in your tests. It might make sense to gather such helpers in a clearly marked file, such as one of these:

.                              
├── ...
└── R
    ├── ...
    ├── test-helpers.R
    ├── test-utils.R
    ├── utils-testing.R
    └── ...

14.3.2 tests/testthat.R

Recall the initial testthat setup described in Section 13.4: The standard tests/testthat.R file looks like this:

We repeat the advice to not edit tests/testthat.R. It is run during R CMD check (and, therefore, devtools::check()), but is not used in most other test-running scenarios (such as devtools::test() or devtools::test_active_file() or during interactive development). Do not attach your dependencies here with library(). Call them in your tests in the same manner as you do below R/.

14.3.3 Testthat helper files

Another type of file that is always executed by load_all() and at the beginning of automated testing is a helper file, defined as any file below tests/testthat/ that begins with helper. Helper files are a mighty weapon in the battle to eliminate code floating around at the top-level of test files. Helper files are a prime example of what we mean when we recommend moving such code into a broader scope. Objects or functions defined in a helper file are available to all of your tests.

If you have just one such file, you should probably name it helper.R. If you organize your helpers into multiple files, you could include a suffix with additional info. Here are examples of how such files might look:

.                              
├── ...
└── tests
    ├── testthat
    │   ├── helper.R
    │   ├── helper-blah.R
    │   ├── helper-foo.R    
    │   ├── test-foofy.R
    │   └── (more test files)
    └── testthat.R

Many developers use helper files to define custom test helper functions, which we describe in detail below. Compared to defining helpers below R/, some people find that tests/testthat/helper.R makes it more clear that these utilities are specifically for testing the package. This location also feels more natural if your helpers rely on testthat functions.

A helper file is also a good location for setup code that is needed for its side effects. This is a case where tests/testthat/helper.R is clearly more appropriate than a file below R/. For example, in an API-wrapping package, helper.R is a good place to (attempt to) authenticate with the testing credentials.

14.3.4 Testthat setup files

Testthat has one more special file type: setup files, defined as any file below test/testthat/ that begins with setup. Here’s an example of how that might look:

.                              
├── ...
└── tests
    ├── testthat
    │   ├── helper.R
    │   ├── setup.R
    │   ├── test-foofy.R
    │   └── (more test files)
    └── testthat.R

A setup file is handled almost exactly like a helper file, but with two big differences:

  • Setup files are not executed by devtools::load_all().
  • Setup files often contain the corresponding teardown code.

Setup files are good for global test setup that is tailored for test execution in non-interactive or remote environments. For example, you might turn off behaviour that’s aimed at an interactive user, such as messaging or writing to the clipboard.

If any of your setup should be reversed after test execution, you should also include the necessary teardown code in setup.R2. We recommend maintaining teardown code alongside the setup code, in setup.R, because this makes it easier to ensure they stay in sync. The artificial environment teardown_env() exists as a magical handle to use in withr::defer() and withr::local_*() / withr::with_*().

Here’s a setup.R example from the reprex package, where we turn off clipboard and HTML preview functionality during testing:

op <- options(reprex.clipboard = FALSE, reprex.html_preview = FALSE)

withr::defer(options(op), teardown_env())

Since we are just modifying options here, we can be even more concise and use the pre-built function withr::local_options() and pass teardown_env() as the .local_envir:

withr::local_options(
  list(reprex.clipboard = FALSE, reprex.html_preview = FALSE),
  .local_envir = teardown_env()
)

14.3.5 Files ignored by testthat

testthat only automatically executes files where these are both true:

  • File is a direct child of tests/testthat/
  • File name starts with one of the specific strings:
    • helper
    • setup
    • test

It is fine to have other files or directories in tests/testthat/, but testthat won’t automatically do anything with them (other than the _snaps directory, which holds snapshots).

14.3.6 Storing test data

Many packages contain files that hold test data. Where should these be stored? The best location is somewhere below tests/testthat/, often in a subdirectory, to keep things neat. Below is an example, where useful_thing1.rds and useful_thing2.rds hold objects used in the test files.

.
├── ...
└── tests
    ├── testthat
    │   ├── fixtures
    │   │   ├── make-useful-things.R
    │   │   ├── useful_thing1.rds
    │   │   └── useful_thing2.rds
    │   ├── helper.R
    │   ├── setup.R
    │   └── (all the test files)
    └── testthat.R

Then, in your tests, use testthat::test_path() to build a robust filepath to such files.

test_that("foofy() does this", {
  useful_thing <- readRDS(test_path("fixtures", "useful_thing1.rds"))
  # ...
})

testthat::test_path() is extremely handy, because it produces the correct path in the two important modes of test execution:

  • Interactive test development and maintenance, where working directory is presumably set to the top-level of the package.
  • Automated testing, where working directory is usually set to something below tests/.

14.3.7 Where to write files during testing

If it’s easy to avoid writing files from you tests, that is definitely the best plan. But there are many times when you really must write files.

You should only write files inside the session temp directory. Do not write into your package’s tests/ directory. Do not write into the current working directory. Do not write into the user’s home directory. Even though you are writing into the session temp directory, you should still clean up after yourself, i.e. delete any files you’ve written.

Most package developers don’t want to hear this, because it sounds like a hassle. But it’s not that burdensome once you get familiar with a few techniques and build some new habits. A high level of file system discipline also eliminates various testing bugs and will absolutely make your CRAN life run more smoothly.

This test is from roxygen2 and demonstrates everything we recommend:

test_that("can read from file name with utf-8 path", {
  path <- withr::local_tempfile(
    pattern = "Universit\u00e0-",
    lines = c("#' @include foo.R", NULL)
  )
  expect_equal(find_includes(path), "foo.R")
})

withr::local_tempfile() creates a file within the session temp directory whose lifetime is tied to the “local” environment – in this case, the execution environment of an individual test. It is a wrapper around base::tempfile() and passes, e.g., the pattern argument through, so you have some control over the file name. You can optionally provide lines to populate the file with at creation time or you can write to the file in all the usual ways in subsequent steps. Finally, with no special effort on your part, the temporary file will automatically be deleted at the end of the test.

Sometimes you need even more control over the file name. In that case, you can use withr::local_tempdir() to create a self-deleting temporary directory and write intentionally-named files inside this directory.


  1. Base R’s on.exit() is another alternative, but it requires more from you. You need to capture the original state and write the restoration code yourself. Also remember to do on.exit(..., add = TRUE) if there’s any chance a second on.exit() call could be added in the test. You probably also want to default to after = FALSE.↩︎

  2. A legacy approach (which still works, but is no longer recommended) is to put teardown code in tests/testthat/teardown.R.↩︎