Explore UK Biobank data

Ken B. Hanscombe


The UK Biobank is a resource that includes detailed health-related and genetic data on about 500,000 individuals and is available to the research community. ukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis, and provides tools to assist in quality control, query of disease diagnoses, and retrieval of genetic metadata.

Getting started

Download and decrypt your data with the supplied helper programs. To use ukbtools, you need to create a UKB fileset (.tab, .r, and .html):

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs

ukb_unpack decrypts your downloaded ukbxxxx.enc file, outputting a ukbxxxx.enc_ukb file. ukb_conv with the r flag converts the decrypted data to a tab-delimited file ukbxxxx.tab and an R script ukbxxxx.r that reads the tab file. The docs flag creates an html file containing a field-code-to-description table (among others).

Note. Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation . Updated versions of these helper programs exist. Other than small name changes (underscores removed) they appear to function similarly.

Installing the package

In R,

# Install from CRAN

# Install latest development version
devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)

Making a dataset

The function ukb_df() takes the stem of your fileset and returns a dataframe with usable column names.


my_ukb_data <- ukb_df("ukbxxxx")

You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data

my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/ukb/fileset/data")

Making a key

Use ukb_df_field to create a field code-to-descriptive name key, as dataframe or named lookup vector.

my_ukb_key <- ukb_df_field("ukbxxxx", path = "/full/path/to/my/ukb/fileset/data")

Note. You can move the three files in your fileset after creating them with ukb_conv, but they should be kept together. ukb_df() automatically updates the read call in the R source file to point to the correct directory (the current directly by default, or the directory specified by path).

Memory and efficiency

To reduce you memory usage, you could save your new UKB dataset with save(my_ukb_data, file = "my_ukb_data.rda"). Load the dataset with load("my_ukb_data.rda"). A UKB dataset from my largest UKB fileset which included a 2.6 GB .tab file took a little under 2 minutes to create with ukb_df. The associated .rda file was 138 MB and loaded in a little under 1.5 mins.

Multiple downloads

If you have multiple UKB downloads, first read each one in, then merge them with your preferred method. You could use ukb_df_full_join which is a thin wrapper around dplyr::full_join applied recursively with purrr::reduce.

ukbxxxx_data <- ukb_df("ukbxxxx")
ukbyyyy_data <- ukb_df("ukbyyyy")
ukbzzzz_data <- ukb_df("ukbzzzz")

ukb_df_full_join(ukbxxxx_data, ukbyyyy_data, ukbzzzz_data)

Repeated variables.

The join key is set to “eid” only (default value of the by parameter). Any additional variables common to any two tables will have “.x” and “.y” appended to their names. If you are satisfied the additional variables are identical to the original, the copies can be safely deleted. For example, if setequal(my_ukb_data$var, my_ukb_data$var.x) is TRUE, then my_ukb_data$var.x can be dropped. A dlyr::full_join is like the set operation union in that all observation from all tables are included, i.e., all samples are included even if they are not included in all datasets.

Repeated variable names within UKB datasets are unlikely to occur. ukb_df creates variable names by combining a snake_case descriptor with the variable’s field code, index and array. This should be sufficient to uniquely identify the variable. However, if an index_array combination is incorrectly repeated in the original UKB data, this will result in a duplicated variable name. We observed two instances. The variables were encoded –0.0, –1.0, ––1.0, and ukb_df created a variable named var_0_0, var_1_0, var_1_0. This is probably a typo that should have been –0.0, –1.0, –2.0, consistent with UKB official documentation describing the field as having 3 values for index. We have provided ukb_df_duplicated_names to identify duplicated names within a dataset. This will allow the user to make changes as appropriate. We expect the occurrence of such duplicates will be rare.

An example fileset

ukbxxxx.tab, ukbxxxx.r, ukbxxxx.html

A minimal example fileset is included with the package, in the subdirectory inst/extdata. This fileset will allow the user to test the the read (ukb_df, ukb_df_field) and summarise (ukb_context) functionality.

# To load the example data
path_to_example_data <- system.file("extdata", package = "ukbtools")

df <- ukb_df("ukbxxxx", path = path_to_example_data)

# To create a field code to name key
df_field <- ukb_df_field("ukbxxxx", path = path_to_example_data)

The full path to the raw test data can be retrieved with system.file("extdata", "ukbXXXX.tab", package = "ukbtools").

Exploring primary demographics of a UKB subset

As an exploratory step you might want to look at the demographics of a particular subset of the UKB sample relative to a reference sample. For example, using the nonmiss.var argument of ukb_context will produce a plot of the primary demographics (sex, age, ethnicity, and Townsend deprivation score) and employment status and assessment centre, for the subsample with data on your variable of interest compared to those without data (i.e. NA).

ukb_context(my_ukb_data, nonmiss.var = "my_variable_of_interest")

It is also possible to supply a logical vector with subset.var to define the subset and reference sample. This is particularly useful for understanding a subgroup within the UKB study, e.g., overweight individuals.

subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_0_0 >= 25) 
ukb_context(my_ukb_data, subset.var = subgroup_of_interest)