Using `tidylog`

adds a small overhead to each function call. For instance, because tidylog needs to figure out how many rows were dropped when you use `tidylog::filter`

, this call will be a bit slower than using `dplyr::filter`

directly. The overhead is usually not noticeable, but can be for larger datasets, especially when using joins. The benchmarks below give some impression of how large the overhead is.

On a small dataset:

```
bench::mark(
dplyr::filter(mtcars, cyl == 4),
tidylog::filter(mtcars, cyl == 4), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::filter(mtcars, cyl == 4) | 1.81ms | 3.72ms | 98 |

tidylog::filter(mtcars, cyl == 4) | 3.85ms | 6.49ms | 97 |

On a larger dataset:

```
df <- tibble(x = rnorm(100000))
bench::mark(
dplyr::filter(df, x > 0),
tidylog::filter(df, x > 0), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::filter(df, x > 0) | 7.47ms | 13.4ms | 95 |

tidylog::filter(df, x > 0) | 8.8ms | 12.9ms | 96 |

On a small dataset:

```
bench::mark(
dplyr::mutate(mtcars, cyl = as.factor(cyl)),
tidylog::mutate(mtcars, cyl = as.factor(cyl)), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::mutate(mtcars, cyl = as.factor(cyl)) | 3.11ms | 5.63ms | 97 |

tidylog::mutate(mtcars, cyl = as.factor(cyl)) | 5.31ms | 8.2ms | 94 |

On a larger dataset:

```
df <- tibble(x = round(runif(10000) * 10))
bench::mark(
dplyr::mutate(df, x = as.factor(x)),
tidylog::mutate(df, x = as.factor(x)), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::mutate(df, x = as.factor(x)) | 15.4ms | 21.1ms | 95 |

tidylog::mutate(df, x = as.factor(x)) | 19.7ms | 26.7ms | 93 |

Joins are the most expensive operation, as tidylog has to do two additional joins behind the scenes.

On a small dataset:

```
bench::mark(
dplyr::inner_join(band_members, band_instruments, by = "name"),
tidylog::inner_join(band_members, band_instruments, by = "name"), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::inner_join(band_members, band_instruments, by = “name”) | 6.12ms | 8.9ms | 95 |

tidylog::inner_join(band_members, band_instruments, by = “name”) | 23.86ms | 34.9ms | 82 |

On a larger dataset (with many row duplications):

```
N <- 1000
df1 <- tibble(x1 = rnorm(N), key = round(runif(N) * 10))
df2 <- tibble(x2 = rnorm(N), key = round(runif(N) * 10))
bench::mark(
dplyr::inner_join(df1, df2, by = "key"),
tidylog::inner_join(df1, df2, by = "key"), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```

expression | min | median | n_itr |
---|---|---|---|

dplyr::inner_join(df1, df2, by = “key”) | 11.8ms | 16.1ms | 91 |

tidylog::inner_join(df1, df2, by = “key”) | 31ms | 38.1ms | 83 |