name,where,temp
Adam,beach,95
Bess,coast,91
Cora,seashore,28
Dale,beach,85
Evan,seaside,31
5 The package within
This part of the book ends the same way it started, with the development of a small toy package. Chapter 1 established the basic mechanics, workflow, and tooling of package development, but said practically nothing about the R code inside the package. This chapter focuses primarily on the package’s R code and how it differs from R code in a script.
Starting with a data analysis script, you learn how to find the package that lurks within. You’ll isolate and then extract reusable data and logic from the script, put this into an R package, and then use that package in a much simplified script. We’ve included a few rookie mistakes along the way, in order to highlight special considerations for the R code inside a package.
Note that the section headers incorporate the NATO phonetic alphabet (alfa, bravo, etc.) and have no specific meaning. They are just a convenient way to mark our progress towards a working package. It is fine to follow along just by reading and this chapter is completely self-contained, i.e. it’s not a prerequisite for material later in the book. But if you wish to see the state of specific files along the way, they can be found in the source files for the book.
5.1 Alfa: a script that works
Let’s consider data-cleaning.R
, a fictional data analysis script for a group that collects reports from people who went for a swim:
Where did you swim and how hot was it outside?
Their data usually comes as a CSV file, such as swim.csv
:
data-cleaning.R
begins by reading swim.csv
into a data frame:
infile <- "swim.csv"
(dat <- read.csv(infile))
#> name where temp
#> 1 Adam beach 95
#> 2 Bess coast 91
#> 3 Cora seashore 28
#> 4 Dale beach 85
#> 5 Evan seaside 31
They then classify each observation as using American (“US”) or British (“UK”) English, based on the word chosen to describe the sandy place where the ocean and land meet. The where
column is used to build the new english
column.
dat$english[dat$where == "beach"] <- "US"
dat$english[dat$where == "coast"] <- "US"
dat$english[dat$where == "seashore"] <- "UK"
dat$english[dat$where == "seaside"] <- "UK"
Sadly, the temperatures are often reported in a mix of Fahrenheit and Celsius. In the absence of better information, they guess that Americans report temperatures in Fahrenheit and therefore those observations are converted to Celsius.
dat$temp[dat$english == "US"] <- (dat$temp[dat$english == "US"] - 32) * 5/9
dat
#> name where temp english
#> 1 Adam beach 35.0 US
#> 2 Bess coast 32.8 US
#> 3 Cora seashore 28.0 UK
#> 4 Dale beach 29.4 US
#> 5 Evan seaside 31.0 UK
Finally, this cleaned (cleaner?) data is written back out to a CSV file. They like to capture a timestamp in the filename when they do this1.
Here is data-cleaning.R
in its entirety:
infile <- "swim.csv"
(dat <- read.csv(infile))
dat$english[dat$where == "beach"] <- "US"
dat$english[dat$where == "coast"] <- "US"
dat$english[dat$where == "seashore"] <- "UK"
dat$english[dat$where == "seaside"] <- "UK"
dat$temp[dat$english == "US"] <- (dat$temp[dat$english == "US"] - 32) * 5/9
dat
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
(outfile <- paste0(timestamp, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile)))
write.csv(dat, file = outfile, quote = FALSE, row.names = FALSE)
Even if your typical analytical tasks are quite different, hopefully you see a few familiar patterns here. It’s easy to imagine that this group does very similar pre-processing of many similar data files over time. Their analyses can be more efficient and consistent if they make these standard data maneuvers available to themselves as functions in a package, instead of inlining the same data and logic into dozens or hundreds of data ingest scripts.
5.2 Bravo: a better script that works
The package that lurks within the original script is actually pretty hard to see! It’s obscured by a few suboptimal coding practices, such as the use of repetitive copy/paste-style code and the mixing of code and data. Therefore a good first step is to refactor this code, isolating as much data and logic as possible in proper objects and functions, respectively.
This is also a good time to introduce the use of some add-on packages, for several reasons. First, we would actually use the tidyverse for this sort of data wrangling. Second, many people use add-on packages in their scripts, so it is good to see how add-on packages are handled inside a package.
Here’s the new and improved version of the script.
library(tidyverse)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
lookup_table <- tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
dat <- dat %>%
left_join(lookup_table)
f_to_c <- function(x) (x - 32) * 5/9
dat <- dat %>%
mutate(temp = if_else(english == "US", f_to_c(temp), temp))
dat
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
write_csv(dat, outfile_path(infile))
The key changes to note are:
- We are using functions from tidyverse packages (specifically from readr and dplyr) and we make them available with
library(tidyverse)
. - The map between different “beach” words and whether they are considered to be US or UK English is now isolated in a lookup table, which lets us create the
english
column in one go with aleft_join()
. This lookup table makes the mapping easier to comprehend and would be much easier to extend in the future with new “beach” words. -
f_to_c()
,timestamp()
, andoutfile_path()
are new helper functions that hold the logic for converting temperatures and forming the timestamped output file name.
It’s getting easier to recognize the reusable bits of this script, i.e. the bits that have nothing to do with a specific input file, like swim.csv
. This sort of refactoring often happens naturally on the way to creating your own package, but if it does not, it’s a good idea to do this intentionally.
5.3 Charlie: a separate file for helper functions
A typical next step is to move reusable data and logic out of the analysis script and into one or more separate files. This is a conventional opening move, if you want to use these same helper files in multiple analyses.
Here is the content of beach-lookup-table.csv
:
where,english
beach,US
coast,US
seashore,UK
seaside,UK
Here is the content of cleaning-helpers.R
:
library(tidyverse)
localize_beach <- function(dat) {
lookup_table <- read_csv(
"beach-lookup-table.csv",
col_types = cols(where = "c", english = "c")
)
left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
We’ve added some high-level helper functions, localize_beach()
and celsify_temp()
, to the pre-existing helpers (f_to_c()
, timestamp()
, and outfile_path()
).
Here is the next version of the data cleaning script, now that we’ve pulled out the helper functions (and lookup table).
Notice that the script is getting shorter and, hopefully, easier to read and modify, because repetitive and fussy clutter has been moved out of sight. Whether the code is actually easier to work with is subjective and depends on how natural the “interface” feels for the people who actually preprocess swimming data. These sorts of design decisions are the subject of a separate project: design.tidyverse.org.
Let’s assume the group agrees that our design decisions are promising, i.e. we seem to be making things better, not worse. Sure, the existing code is not perfect, but this is a typical developmental stage when you’re trying to figure out what the helper functions should be and how they should work.
5.4 Delta: a failed attempt at making a package
While this first attempt to create a package will end in failure, it’s still helpful to go through some common missteps, to illuminate what happens behind the scenes.
Here are the simplest steps that you might take, in an attempt to convert cleaning-helpers.R
into a proper package:
- Use
usethis::create_package("path/to/delta")
to scaffold a new R package, with the name “delta”.- This is a good first step!
- Copy
cleaning-helpers.R
into the new package, specifically, toR/cleaning-helpers.R
.- This is morally correct, but mechanically wrong in several ways, as we will soon see.
- Copy
beach-lookup-table.csv
into the new package. But where? Let’s try the top-level of the source package.- This is not going to end well. Shipping data files in a package is a special topic, which is covered in Chapter 7.
- Install this package, perhaps using
devtools::install()
or via Ctrl + Shift + B (Windows & Linux) or Cmd + Shift + B in RStudio.- Despite all of the problems identified above, this actually works! Which is interesting, because we can (try to) use it and see what happens.
Here is the next version of the data cleaning script that you hope will run after successfully installing this package (which we’re calling “delta”).
The only change from our previous script is that
source("cleaning-helpers.R")
has been replaced by
library(delta)
Here’s what actually happens if you install the delta package and try to run the data cleaning script:
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
#> Error in localize_beach(.) : could not find function "localize_beach"
write_csv(dat, outfile_path(infile))
#> Error in outfile_path(infile) : could not find function "outfile_path"
None of the helper functions are actually available for use, even though you call library(delta)
! In contrast to source()
ing a file of helper functions, attaching a package does not dump its functions into the global workspace. By default, functions in a package are only for internal use. You need to export localize_beach()
, celsify_temp()
, and outfile_path()
so your users can call them. In the devtools workflow, we achieve this by putting @export
in the special roxygen comment above each function (namespace management is covered in Section 11.3), like so:
After you add the @export
tag to localize_beach()
, celsify_temp()
, and outfile_path()
, you run devtools::document()
to (re)generate the NAMESPACE
file, and re-install the delta package. Now when you re-execute the data cleaning script, it works!
Correction: it sort of works sometimes. Specifically, it works if and only if the working directory is set to the top-level of the source package. From any other working directory, you still get an error:
The lookup table consulted inside localize_beach()
cannot be found. One does not simply dump CSV files into the source of an R package and expect things to “just work”. We will fix this in our next iteration of the package (Chapter 7 has full coverage of how to include data in a package).
Before we abandon this initial experiment, let’s also marvel at the fact that you were able to install, attach, and, to a certain extent, use a fundamentally broken package. devtools::load_all()
works fine, too! This is a sobering reminder that you should be running R CMD check
, probably via devtools::check()
, very often during development. This will quickly alert you to many problems that simple installation and usage does not reveal.
Indeed, check()
fails for this package and you see this:
* installing *source* package ‘delta’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error in library(tidyverse) : there is no package called ‘tidyverse’
Error: unable to load R code in package ‘delta’
Execution halted
ERROR: lazy loading failed for package ‘delta’
* removing ‘/Users/jenny/rrr/delta.Rcheck/delta’
What do you mean “there is no package called ‘tidyverse’”?!? We’re using it, with no problems, in our main script! Also, we’ve already installed and used this package, why can’t R CMD check
find it?
This error is what happens when the strictness of R CMD check
meets the very first line of R/cleaning-helpers.R
:
This is not how you declare that your package depends on another package (the tidyverse, in this case). This is also not how you make functions in another package available for use in yours. Dependencies must be declared in DESCRIPTION
(and that’s not all). Since we declared no dependencies, R CMD check
takes us at our word and tries to install our package with only the base packages available, which means this library(tidyverse)
call fails. A “regular” installation succeeds, simply because the tidyverse is available in your regular library, which hides this particular mistake.
To review, copying cleaning-helpers.R
to R/cleaning-helpers.R
, without further modification, was problematic in (at least) the following ways:
- Does not account for exported vs. non-exported functions.
- The CSV file holding our lookup table cannot be found in the installed package.
- Does not properly declare our dependency on other add-on packages.
5.5 Echo: a working package
We’re ready to make the most minimal version of this package that actually works.
Here is the new version of R/cleaning-helpers.R
2:
lookup_table <- dplyr::tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
#' @export
localize_beach <- function(dat) {
dplyr::left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
#' @export
celsify_temp <- function(dat) {
dplyr::mutate(dat, temp = dplyr::if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
#' @export
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
We’ve gone back to defining the lookup_table
with R code, since the initial attempt to read it from CSV created some sort of filepath snafu. This is OK for small, internal, static data, but remember to see Chapter 7 for more general techniques for storing data in a package.
All of the calls to tidyverse functions have now been qualified with the name of the specific package that actually provides the function, e.g. dplyr::mutate()
. There are other ways to access functions in another package, explained in Section 11.4, but this is our recommended default. It is also our strong recommendation that no one depend on the tidyverse meta-package in a package3. Instead, it is better to identify the specific package(s) you actually use. In this case, the package only uses dplyr.
The library(tidyverse)
call is gone and instead we declare the use of dplyr in the Imports
field of DESCRIPTION
:
Package: echo
(... other lines omitted ...)
Imports:
dplyr
This, together with the use of namespace-qualified calls, like dplyr::left_join()
, constitutes a valid way to use another package within yours. The metadata conveyed via DESCRIPTION
is covered in Chapter 9.
All of the user-facing functions have an @export
tag in their roxygen comment, which means that devtools::document()
adds them correctly to the NAMESPACE
file. Note that f_to_c()
is currently only used internally, inside celsify_temp()
, so it is not exported (likewise for timestamp()
).
This version of the package can be installed, used, AND it technically passes R CMD check
, though with 1 warning and 1 note.
* checking for missing documentation entries ... WARNING
Undocumented code objects:
‘celsify_temp’ ‘localize_beach’ ‘outfile_path’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
* checking R code for possible problems ... NOTE
celsify_temp: no visible binding for global variable ‘english’
celsify_temp: no visible binding for global variable ‘temp’
Undefined global functions or variables:
english temp
The “no visible binding” note is a peculiarity of using dplyr and unquoted variable names inside a package, where the use of bare variable names (english
and temp
) looks suspicious. You can add either of these lines to any file below R/
to eliminate this note (such as the package-level documentation file described in Section 16.7):
# option 1 (then you should also put utils in Imports)
utils::globalVariables(c("english", "temp"))
# option 2
english <- temp <- NULL
We’re seeing that it can be tricky to program around a package like dplyr, which makes heavy use of nonstandard evaluation. Behind the scenes, that is the technique that allows dplyr’s end users to use bare (not quoted) variable names. Packages like dplyr prioritize the experience of the typical end user, at the expense of making them trickier to depend on. The two options shown above for suppressing the “no visible binding” note, represent entry-level solutions. For a more sophisticated treatment of these issues, see vignette("in-packages", package = "dplyr")
and vignette("programming", package = "dplyr")
.
The warning about missing documentation is because the exported functions have not been properly documented. This is a valid concern and something you should absolutely address in a real package. You’ve already seen how to create help files with roxygen comments in Section 1.12 and we cover this topic thoroughly in Chapter 16.
5.6 Foxtrot: build time vs. run time
The echo package works, which is great, but group members notice something odd about the timestamps:
Sys.time()
#> [1] "2023-03-26 22:48:48 PDT"
outfile_path("INFILE.csv")
#> [1] "2020-September-03_11-06-33_INFILE_clean.csv"
The datetime in the timestamped filename doesn’t reflect the time reported by the system. In fact, the users claim that the timestamp never seems to change at all! Why is this?
Recall how we form the filepath for output files:
The fact that we capture now <- Sys.time()
outside of the definition of outfile_path()
has probably been vexing some readers for a while. now
reflects the instant in time when we execute now <- Sys.time()
. In the initial approach, now
was assigned when we source()
d cleaning-helpers.R
. That’s not ideal, but it was probably a pretty harmless mistake, because the helper file would be source()
d shortly before we wrote the output file.
But this approach is quite devastating in the context of a package. now <- Sys.time()
is executed when the package is built4. And never again. It is very easy to assume your package code is re-evaluated when the package is attached or used. But it is not. Yes, the code inside your functions is absolutely run whenever they are called. But your functions – and any other objects created in top-level code below R/
– are defined exactly once, at build time.
By defining now
with top-level code below R/
, we’ve doomed our package to timestamp all of its output files with the same (wrong) time. The fix is to make sure the Sys.time()
call happens at run time.
Let’s look again at parts of R/cleaning-helpers.R
:
lookup_table <- dplyr::tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
There are four top-level <-
assignments in this excerpt. The top-level definitions of the data frame lookup_table
and the functions timestamp()
and outfile_path()
are correct. It is appropriate that these be defined exactly once, at build time. The top-level definition of now
, which is then used inside outfile_path()
, is regrettable.
Here are better versions of outfile_path()
:
# always timestamp as "now"
outfile_path <- function(infile) {
ts <- timestamp(Sys.time())
paste0(ts, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
# allow user to provide a time, but default to "now"
outfile_path <- function(infile, time = Sys.time()) {
ts <- timestamp(time)
paste0(ts, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
This illustrates that you need to have a different mindset when defining objects inside a package. The vast majority of those objects should be functions and these functions should generally only use data they create or that is passed via an argument. There are some types of sloppiness that are fairly harmless when a function is defined immediately before its use, but that can be more costly for functions distributed as a package.
5.7 Golf: side effects
The timestamps now reflect the current time, but the group raises a new concern. As it stands, the timestamps reflect who has done the data cleaning and which part of the world they’re in. The heart of the timestamp strategy is this format string5:
This formats Sys.time()
in such a way that it includes the month name (not number) and the local time6.
Table 5.1 shows what happens when such a timestamp is produced by several hypothetical colleagues cleaning some data at exactly the same instant in time.
location | timestamp | LC_TIME | tz |
---|---|---|---|
Rome, Italy | 2020-settembre-05_00-30-00 | it_IT.UTF-8 | Europe/Rome |
Warsaw, Poland | 2020-września-05_00-30-00 | pl_PL.UTF-8 | Europe/Warsaw |
Sao Paulo, Brazil | 2020-setembro-04_19-30-00 | pt_BR.UTF-8 | America/Sao_Paulo |
Greenwich, England | 2020-September-04_23-30-00 | en_GB.UTF-8 | Europe/London |
“Computer World!” | 2020-September-04_22-30-00 | C | UTC |
Note that the month names vary, as does the time, and even the date! The safest choice is to form timestamps with respect to a fixed locale and time zone (presumably the non-geographic choices represented by “Computer World!” above).
You do some research and learn that you can force a certain locale via Sys.setlocale()
and force a certain time zone by setting the TZ environment variable. Specifically, we set the LC_TIME component of the locale to “C” and the time zone to “UTC” (Coordinated Universal Time). Here’s your first attempt to improve timestamp()
:
timestamp <- function(time = Sys.time()) {
Sys.setlocale("LC_TIME", "C")
Sys.setenv(TZ = "UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
But your Brazilian colleague notices that datetimes print differently, before and after she uses outfile_path()
from your package:
Before:
#> [1] "2024-novembro-20_04-15-21"
After:
Notice that her month name switched from Portuguese to English and the time is clearly being reported in a different time zone. The calls to Sys.setlocale()
and Sys.setenv()
inside timestamp()
have made persistent (and very surprising) changes to her R session. This sort of side effect is very undesirable and is extremely difficult to track down and debug, especially in more complicated settings.
Here are better versions of timestamp()
:
# use withr::local_*() functions to keep the changes local to timestamp()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
withr::local_timezone("UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
# use the tz argument to format.POSIXct()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
}
# put the format() call inside withr::with_*()
timestamp <- function(time = Sys.time()) {
withr::with_locale(
c("LC_TIME" = "C"),
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
)
}
These show various methods to limit the scope of our changes to LC_TIME and the timezone. A good rule of thumb is to make the scope of such changes as narrow as possible and practical. The tz
argument of format()
is the most surgical way to deal with the timezone, but nothing similar exists for LC_TIME. We make the temporary locale modification using the withr package, which provides a very flexible toolkit for temporary state changes. This (and base::on.exit()
) are discussed further in Section 6.5. Note that if you use withr as we do above, you would need to list it in DESCRIPTION
in Imports
(Chapter 11, Section 10.1.3).
This underscores a point from the previous section: you need to adopt a different mindset when defining functions inside a package. Try to avoid making any changes to the user’s overall state. If such changes are unavoidable, make sure to reverse them (if possible) or to document them explicitly (if related to the function’s primary purpose).
5.8 Concluding thoughts
Finally, after several iterations, we have successfully extracted the repetitive data cleaning code for the swimming survey into an R package. This example concludes the first part of the book and marks the transition into more detailed reference material on specific package components. Before we move on, let’s review the lessons learned in this chapter.
5.8.1 Script vs. package
When you first hear that expert R users often put their code into packages, you might wonder exactly what that means. Specifically, what happens to your existing R scripts, R Markdown reports, and Shiny apps? Does all of that code somehow get put into a package? The answer is “no”, in most contexts.
Typically, you identify certain recurring operations that occur across multiple projects and this is what you extract into an R package. You will still have R scripts, R Markdown reports, and Shiny apps, but by moving specific pieces of code into a formal package, your data products tend to become more concise and easier to maintain.
5.8.2 Finding the package within
Although the example in this chapter is rather simple, it still captures the typical process of developing an R package for personal or organizational use. You typically start with a collection of idiosyncratic and related R scripts, scattered across different projects. Over time, you begin to notice that certain needs come up over and over again.
Each time you revisit a similar analysis, you might try to elevate your game a bit, compared to the previous iteration. You refactor copy/paste-style code using more robust patterns and start to encapsulate key “moves” in helper functions, which might eventually migrate into their own file. Once you reach this stage, you’re in a great position to take the next step and create a package.
5.8.3 Package code is different
Writing package code is a bit different from writing R scripts and it’s natural to feel some discomfort when making this adjustment. Here are the most common gotchas that trip many of us up at first:
- Package code requires new ways of working with functions in other packages. The
DESCRIPTION
file is the principal way to declare dependencies; we don’t do this vialibrary(somepackage)
. - If you want data or files to be persistently available, there are package-specific methods of storage and retrieval. You can’t just put files in the package and hope for the best.
- It’s necessary to be explicit about which functions are user-facing and which are internal helpers. By default, functions are not exported for use by others.
- A new level of discipline is required to ensure that code runs at the intended time (build time vs. run time) and that there are no unintended side effects.
Sys.time()
returns an object of classPOSIXct
, therefore when we callformat()
on it, we are actually usingformat.POSIXct()
. Read the help for?format.POSIXct
if you’re not familiar with such format strings.↩︎Putting everything in one file, with this name, is not ideal, but it is technically allowed. We discuss organising and naming the files below
R/
in Section 6.1.↩︎The blog post The tidyverse is for EDA, not packages elaborates on this.↩︎
Here we’re referring to when the package code is compiled, which could be either when the binary is made (for macOS or Windows; Section 3.4) or when the package is installed from source (Section 3.5).↩︎
Sys.time()
returns an object of classPOSIXct
, therefore when we callformat()
on it, we are actually usingformat.POSIXct()
. Read the help for?format.POSIXct
if you’re not familiar with such format strings.↩︎It would clearly be better to format according to ISO 8601, which encodes the month by number, but please humor me for the sake of making this example more obvious.↩︎