Databases in R and Tidyverse
If you’re using R to do data analysis, most of the data you need probably already lives in a relational database. This blog post discusses how to use R to load data in to a local DataBase, interact with the DB from Rstudio IDE, and to capitalize on dplyr’s database tools. This tutorial will use the nycflights13::planes and nycflights13::flights data to feed our database. Typically, data would already exist in the database of interest.
First you have to install and load the following packages:
library(nycflights13)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DBI)
library(RSQLite)
library(tidypredict)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
DBI
Thereafter, we would connect to our database using the DBI package. For the sake of example, I simply connect to an “in-memory” database, but a wide range of database connectors are available depending on where your data lives. It is noteworthy that just like working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in scenarios where data is already in a database, or data exceeds local memory and you need to use some external storage engine.
con <- dbConnect(SQLite(), ":memory:")
The copy_to() function is used to get data to DB, this function has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, carrier, plane, and destination. Creating the right indexes is key to good database performance, but is unfortunately beyond the scope of this article.
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
))
It also possible to copy to a DB without indexing
copy_to(con, nycflights13::planes, "planes")
To access the DB it is either possible to run queries either on a connection using dbGetQuery
dbGetQuery(con, '
SELECT flight, type
FROM flights
JOIN planes
ON flights.tailnum = planes.tailnum
ORDER BY planes.tailnum
limit 10')%>%kable()
flight | type |
---|---|
4560 | Fixed wing multi engine |
4269 | Fixed wing multi engine |
4667 | Fixed wing multi engine |
4334 | Fixed wing multi engine |
4298 | Fixed wing multi engine |
4520 | Fixed wing multi engine |
4297 | Fixed wing multi engine |
4370 | Fixed wing multi engine |
4352 | Fixed wing multi engine |
4695 | Fixed wing multi engine |
this can be done with different types of queries as follows
dbGetQuery(con, '
SELECT COUNT (dest)
FROM flights
WHERE (dest="SEA" AND year=2013)')%>%kable()
COUNT (dest) |
---|
3923 |
dbGetQuery(con, '
SELECT COUNT (distinct carrier) AS "Number of unique airlines"
FROM flights
WHERE (dest="SEA" AND year=2013)')%>%kable()
Number of unique airlines |
---|
5 |
dbGetQuery(con, '
SELECT tailnum, COUNT(tailnum)
FROM flights
GROUP BY tailnum
ORDER BY COUNT(tailnum) DESC
limit 10
')%>%kable()
tailnum | COUNT(tailnum) |
---|---|
N725MQ | 575 |
N722MQ | 513 |
N723MQ | 507 |
N711MQ | 486 |
N713MQ | 483 |
N258JB | 427 |
N298JB | 407 |
N353JB | 404 |
N351JB | 402 |
N735MQ | 396 |
the other opetion would be to output the tables from Rmarkdown using {sql, connection = con, output.var = “df”}
SELECT flight, type
FROM flights
JOIN planes
ON flights.tailnum = planes.tailnum
ORDER BY planes.tailnum
limit 10
here the output is the df Dataframe
df%>%kable()
flight | type |
---|---|
4560 | Fixed wing multi engine |
4269 | Fixed wing multi engine |
4667 | Fixed wing multi engine |
4334 | Fixed wing multi engine |
4298 | Fixed wing multi engine |
4520 | Fixed wing multi engine |
4297 | Fixed wing multi engine |
4370 | Fixed wing multi engine |
4352 | Fixed wing multi engine |
4695 | Fixed wing multi engine |
dbplyr
As you can see above the Structured Query Language SQL is usually used to interact with a database. The dbplyr package in R aims to automatically generate SQL queries for R users. However, SQL is a very large language and dbplyr doesn’t do everything therefore it focusses exclusively on SELECT statements, which is the the SQL you write most often as an analyst.
Most of the time you don’t need to know anything about SQL, and you can continue to use the dplyr verbs that you’re already familiar with. The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible: It never pulls data into R unless you explicitly ask for it. It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
Surprisingly, this sequence of operations never touches the database. It’s not until you ask for the data (e.g. by printing tailnum_delay) that dplyr generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.
we can use tbl() to take a reference to it:
flights_db <- tbl(con, "flights")
Behind the scenes, dplyr is translating your R code into SQL. You can see the SQL it’s generating with show_query(), also it is possible use capture.output() to convert that query (otherwise printed to the R console) to a character vector.
tailnum_delay_db <- flights_db %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay),
n = n()
) %>%
arrange(desc(delay)) %>%
filter(n > 100)
tailnum_delay_db %>% show_query()
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## Warning: ORDER BY is ignored in subqueries without LIMIT
## ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
## <SQL>
## SELECT *
## FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
## FROM `flights`
## GROUP BY `tailnum`)
## WHERE (`n` > 100.0)
tailnum_query <- capture.output(show_query(tailnum_delay_db))
## Warning: ORDER BY is ignored in subqueries without LIMIT
## ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
If you’re familiar with SQL, this probably isn’t exactly the query you’d write but it is a very close approximate. You can learn more about the SQL translation in vignette(“translation-verb”) and vignette(“translation-function”) from the dbplyr package.
Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use collect() to pull all the data down into a local tibble. The collect() function requires that database does some work, so it may take a long time to complete. Otherwise, dplyr tries to prevent you from accidentally performing expensive query operations.
tailnum_delay <- tailnum_delay_db %>% collect()%>%kable()
## Warning: ORDER BY is ignored in subqueries without LIMIT
## ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
tidypredict
It is also possible to run prediction models inside the databases using R. The tidypredict package parses a R model object and then creates the SQL statement needed to calculate the fitted prediction. In other words, it takes a fitted R model object and returns a formula in ‘Tidy Eval’ code that calculates the predictions.
model <- lm(arr_delay ~ month + distance + air_time, data = flights)
tidypredict_sql(model, dbplyr::simulate_mssql())
## <SQL> -1.17390925699898 + (`month` * -0.0414672658738873) + (`distance` * -0.0875558911189957) + (`air_time` * 0.664509571024122)
The tidypredict_sql() function returns a SQL query that contains the coefficients (model$coefficients) operated against the correct variable or categorical variable value. In most cases the resulting SQL is one short CASE WHEN statement per coefficient. It appends the offset field or value, if one is provided.
dbGetQuery(con, 'SELECT -1.17390925699898 + (month * -0.0414672658738873) + (distance * -0.0875558911189957) + (air_time * 0.664509571024122) AS estimated_Delay, arr_delay
FROM flights
limit 10')%>%kable()
estimated_Delay | arr_delay |
---|---|
27.050048 | 11 |
25.649154 | 20 |
9.757789 | 33 |
-17.598209 | -18 |
9.150145 | -25 |
35.508373 | 12 |
10.530112 | 19 |
13.953332 | -14 |
9.163202 | -8 |
26.308476 | 8 |