Etienne Bacher: Some notes about improving base R code

Etienne Bacher

^{Preview image coming from: https://trainingindustry.com/magazine/nov-dec-2018/life-in-the-fast-lane-accelerated-continuous-development-for-fast-paced-organizations/}

Lately I’ve spent quite some time on packages that require (almost) only base R:

datawizard, a package belonging to the easystats ecosystem, whose goal is to provide tools for data wrangling and statistical transformations;
poorman, whose goal is to reproduce tidyverse functions (with a strong focus on dplyr) using base R only.

I’ve used bench::mark() and profvis::profvis() a lot to improve code performance and here are a few things I learnt. By default, bench::mark() checks that all expressions return the same output, so we can be confident that the alternatives I show in this post are truly equivalent.

Before we start, I want to precise a few things.

First, these performance improvements are targeted to package developers. A random user shouldn’t really care if a function takes 200 milliseconds less to run. However, I think a package developer might find these tips interesting.

Second, if you find some ways to speed up my alternatives, feel free to comment. I know that there are a bunch of packages whose reputation is built on being very fast (for example data.table and collapse). I’m only showing some base R code alternatives here.

Finally, here’s a small function that I use to make a classic dataset (like iris or mtcars) much bigger.

make_big <- function(data, nrep = 500000) {
  tmp <- vector("list", length = nrep)
  for (i in 1:nrep) {
    tmp[[i]] <- data
  }
  
  data.table::rbindlist(tmp) |> 
    as.data.frame()
}

Check if a vector has a single value

One easy way to do this is to run length(unique(x)) == 1, which basically means that first we have to collect all unique values and then count them. This can be quite inefficient: it would be enough to stop as soon as we find two different values.

What we can do is to compare all values to the first value of the vector. Below is an example with a vector containing 10 million values. In the first case, it only contains 1, and in the second case it contains 1 and 2.

# Should be TRUE
test <- rep(1, 1e7)

bench::mark(
  length(unique(test)) == 1,
  all(test == test[1]),
  iterations = 10
)

# A tibble: 2 × 6
  expression                  min  median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>              <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
1 length(unique(test)) =… 161.8ms 185.6ms      5.31   166.1MB     5.31
2 all(test == test[1])     44.2ms  69.7ms     14.9     38.1MB     4.47

# Should be FALSE
test2 <- rep(c(1, 2), 1e7)

bench::mark(
  length(unique(test2)) == 1,
  all(test2 == test2[1]),
  iterations = 10
)

# A tibble: 2 × 6
  expression                  min  median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>              <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
1 length(unique(test2)) … 342.2ms 390.6ms      2.46   332.3MB     2.46
2 all(test2 == test2[1])   63.2ms  71.6ms     11.5     76.3MB     2.30

This is also faster for character vectors:

# Should be FALSE
test3 <- rep(c("a", "b"), 1e7)

bench::mark(
  length(unique(test3)) == 1,
  all(test3 == test3[1]),
  iterations = 10
)

# A tibble: 2 × 6
  expression                   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>               <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
1 length(unique(test3)) =… 287.8ms  326ms      3.00   332.3MB     3.00
2 all(test3 == test3[1])    82.7ms  107ms      8.73    76.3MB     1.75

Concatenate columns

Sometimes we need to concatenate columns, for example if we want to create a unique id from several grouping columns.

test <- data.frame(
  origin = c("A", "B", "C"),
  destination = c("Z", "Y", "X"),
  value = 1:3
)

test <- make_big(test)

One option to do this is to combine paste() and apply() using MARGIN = 1 to apply paste() to each row. However, a faster way to do this is to use do.call() instead of apply():

bench::mark(
  apply = apply(test[, c("origin", "destination")], 1, paste, collapse = "_"),
  do.call = do.call(paste, c(test[, c("origin", "destination")], sep = "_"))
)

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply         7.36s    7.36s     0.136    80.1MB     5.71
2 do.call    128.14ms 139.29ms     7.08     11.4MB     0

Giving attributes to large dataframes

This one comes from these StackOverflow question and answer. Manipulating a dataframe can remove some attributes. For example, if I give an attribute foo to a large dataframe:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
attr(orig, "foo")

[1] TRUE

If I reorder the columns, this attribute disappears:

new <- orig[, c(2, 1)]
attr(new, "foo")

NULL

We can put it back with:

attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attr(new, "foo")

[1] TRUE

But this takes some time because we also copy the 10M row names of the dataset. Therefore, one option is to create a custom function that only copies the attributes that were in orig but are not in new (in this case, only attribute foo is concerned):

replace_attrs <- function(obj, new_attrs) {
  for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
    attr(obj, which = nm) <- new_attrs[[nm]]
  }
  return(obj)
}

bench::mark(
  old = {
    attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
    head(new)
  },
  new = {
    new <- replace_attrs(new, attributes(orig))
    head(new)
  }
)

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 old            68ms     82ms      12.9    38.2MB     4.29
2 new          52.8µs   80.5µs   11188.     24.4KB     8.77

Find empty rows

It can be useful to remove empty rows, meaning rows containing only NA or "". We could once again use apply() with MARGIN = 1, but a faster way is to use rowSums(). First, we create a data frame full of TRUE/FALSE with is.na(test) | test == "", and then we count by row the number of TRUE. If this number is equal to the number of columns, then it means that the row only has NA or "".

test <- data.frame(
  a = c(1, 2, 3, NA, 5),
  b = c("", NA, "", NA, ""),
  c = c(NA, NA, NA, NA, NA),
  d = c(1, NA, 3, NA, 5),
  e = c("", "", "", "", ""),
  f = factor(c("", "", "", "", "")),
  g = factor(c("", NA, "", NA, "")),
  stringsAsFactors = FALSE
)

test <- make_big(test, 100000)

bench::mark(
  apply = which(apply(test, 1, function(i) all(is.na(i) | i == ""))),
  rowSums = which(rowSums((is.na(test) | test == "")) == ncol(test))
)

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply         2.08s    2.08s     0.480   112.9MB     3.84
2 rowSums    709.59ms 709.59ms     1.41     99.7MB     0

Conclusion

These were just a few tips I discovered. Maybe there are ways to make them even faster in base R? Or maybe you know some weird/hidden tips? If so, feel free to comment below!

Some notes about improving base R code

Check if a vector has a single value

Concatenate columns

Giving attributes to large dataframes

Find empty rows

Conclusion

Corrections

Reuse

Citation