Some notes about improving base R code

A small collection of tips to make base R code faster.

Etienne Bacher
2022-11-28

Preview image coming from: https://trainingindustry.com/magazine/nov-dec-2018/life-in-the-fast-lane-accelerated-continuous-development-for-fast-paced-organizations/

Lately I’ve spent quite some time on packages that require (almost) only base R:

I’ve used bench::mark() and profvis::profvis() a lot to improve code performance and here are a few things I learnt. By default, bench::mark() checks that all expressions return the same output, so we can be confident that the alternatives I show in this post are truly equivalent.

Before we start, I want to precise a few things.

First, these performance improvements are targeted to package developers. A random user shouldn’t really care if a function takes 200 milliseconds less to run. However, I think a package developer might find these tips interesting.

Second, if you find some ways to speed up my alternatives, feel free to comment. I know that there are a bunch of packages whose reputation is built on being very fast (for example data.table and collapse). I’m only showing some base R code alternatives here.

Finally, here’s a small function that I use to make a classic dataset (like iris or mtcars) much bigger.

make_big <- function(data, nrep = 500000) {
  tmp <- vector("list", length = nrep)
  for (i in 1:nrep) {
    tmp[[i]] <- data
  }
  
  data.table::rbindlist(tmp) |> 
    as.data.frame()
}

Check if a vector has a single value

One easy way to do this is to run length(unique(x)) == 1, which basically means that first we have to collect all unique values and then count them. This can be quite inefficient: it would be enough to stop as soon as we find two different values.

What we can do is to compare all values to the first value of the vector. Below is an example with a vector containing 1 million values. In the first case, it only contains 1, and in the second case it contains 1 and 2.

# Should be TRUE
test <- rep(1, 1e7)

bench::mark(
  length(unique(test)) == 1,
  all(test == test[1]),
  iterations = 10
)
# A tibble: 2 × 6
  expression                     min   median itr/se…¹ mem_a…² gc/se…³
  <bch:expr>                <bch:tm> <bch:tm>    <dbl> <bch:b>   <dbl>
1 length(unique(test)) == 1  249.1ms    280ms     3.50 166.1MB    3.50
2 all(test == test[1])        52.3ms     54ms    17.2   38.1MB    3.45
# … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`
# Should be FALSE
test2 <- rep(c(1, 2), 1e7)

bench::mark(
  length(unique(test2)) == 1,
  all(test2 == test2[1]),
  iterations = 10
)
# A tibble: 2 × 6
  expression                      min   median itr/s…¹ mem_a…² gc/se…³
  <bch:expr>                 <bch:tm> <bch:tm>   <dbl> <bch:b>   <dbl>
1 length(unique(test2)) == 1  483.8ms    512ms    1.93 332.3MB    1.93
2 all(test2 == test2[1])       70.4ms     71ms   12.8   76.3MB    2.57
# … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`

This is also faster for character vectors:

# Should be FALSE
test3 <- rep(c("a", "b"), 1e7)

bench::mark(
  length(unique(test3)) == 1,
  all(test3 == test3[1]),
  iterations = 10
)
# A tibble: 2 × 6
  expression                      min   median itr/s…¹ mem_a…² gc/se…³
  <bch:expr>                 <bch:tm> <bch:tm>   <dbl> <bch:b>   <dbl>
1 length(unique(test3)) == 1    449ms    474ms    2.10 332.3MB    2.10
2 all(test3 == test3[1])        134ms    138ms    6.88  76.3MB    1.38
# … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`

Concatenate columns

Sometimes we need to concatenate columns, for example if we want to create a unique id from several grouping columns.

test <- data.frame(
  origin = c("A", "B", "C"),
  destination = c("Z", "Y", "X"),
  value = 1:3
)

test <- make_big(test)

One option to do this is to combine paste() and apply() using MARGIN = 1 to apply paste() to each row. However, a faster way to do this is to use do.call() instead of apply():

bench::mark(
  apply = apply(test[, c("origin", "destination")], 1, paste, collapse = "_"),
  do.call = do.call(paste, c(test[, c("origin", "destination")], sep = "_"))
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply         7.78s    7.78s     0.129    80.1MB     5.14
2 do.call     297.4ms 297.59ms     3.36     11.4MB     0   

Giving attributes to large dataframes

This one comes from these StackOverflow question and answer. Manipulating a dataframe can remove some attributes. For example, if I give an attribute foo to a large dataframe:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
attr(orig, "foo")
[1] TRUE

If I reorder the columns, this attribute disappears:

new <- orig[, c(2, 1)]
attr(new, "foo")
NULL

We can put it back with:

attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attr(new, "foo")
[1] TRUE

But this takes some time because we also copy the 10M row names of the dataset. Therefore, one option is to create a custom function that only copies the attributes that were in orig but are not in new (in this case, only attribute foo is concerned):

replace_attrs <- function(obj, new_attrs) {
  for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
    attr(obj, which = nm) <- new_attrs[[nm]]
  }
  return(obj)
}

bench::mark(
  old = {
    attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
    head(new)
  },
  new = {
    new <- replace_attrs(new, attributes(orig))
    head(new)
  }
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 old            87ms   92.3ms      10.8    38.2MB     2.70
2 new          88.5µs   95.3µs    9422.     24.4KB     6.80

Find empty rows

It can be useful to remove empty rows, meaning rows containing only NA or "". We could once again use apply() with MARGIN = 1, but a faster way is to use rowSums(). First, we create a data frame full of TRUE/FALSE with is.na(test) | test == "", and then we count by row the number of TRUE. If this number is equal to the number of columns, then it means that the row only has NA or "".

test <- data.frame(
  a = c(1, 2, 3, NA, 5),
  b = c("", NA, "", NA, ""),
  c = c(NA, NA, NA, NA, NA),
  d = c(1, NA, 3, NA, 5),
  e = c("", "", "", "", ""),
  f = factor(c("", "", "", "", "")),
  g = factor(c("", NA, "", NA, "")),
  stringsAsFactors = FALSE
)

test <- make_big(test, 100000)

bench::mark(
  apply = which(apply(test, 1, function(i) all(is.na(i) | i == ""))),
  rowSums = which(rowSums((is.na(test) | test == "")) == ncol(test))
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply          2.8s     2.8s     0.357   112.9MB     3.22
2 rowSums     739.3ms  739.3ms     1.35     99.7MB     0   

Conclusion

These were just a few tips I discovered. Maybe there are ways to make them even faster in base R? Or maybe you know some weird/hidden tips? If so, feel free to comment below!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/etiennebacher/personal_website_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Bacher (2022, Nov. 28). Etienne Bacher: Some notes about improving base R code. Retrieved from https://www.etiennebacher.com/posts/2022-11-28-some-notes-about-improving-base-r-code/

BibTeX citation

@misc{bacher2022some,
  author = {Bacher, Etienne},
  title = {Etienne Bacher: Some notes about improving base R code},
  url = {https://www.etiennebacher.com/posts/2022-11-28-some-notes-about-improving-base-r-code/},
  year = {2022}
}