Global Biobank Engine

How should I cite discoveries made using Global Biobank Engine?

We request that any use of data obtained from the Global Biobank Engine be cited in publications using the following format:

Global Biobank Engine, Stanford, CA (URL: http://gbe.stanford.edu) [date (month, year) accessed].
G. McInnes, Y. Tanigawa, C. DeBoever, A. Lavertu, J. E. Olivieri, M. Aguirre, M. A. Rivas, Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics (2019). doi:10.1093/bioinformatics/bty999

We also request that any use of polygenic scores from the Global Biobank Engine be cited in publications using the following format:

Y. Tanigawa, J. Qian, G. R. Venkataraman, J. M. Justesen, R. Li, R. Tibshirani, T. Hastie, M. A. Rivas, Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank. PLOS Genet. 18(3), e1010105 (2022). doi:10.1371/journal.pgen.1010105

We also ask that the developers of the engine be acknowledged as follows:

The authors would like to thank the Rivas lab for making the resource available.

Can I download the data?

Summary statistics of the data presented in GBE can be downloaded from the Rivas Lab GitHub page within the summary stats section.

Coefficients of polygenic scores presented in GBE can be obtained from the sparse polygenic scores analysis page. The significant PRS models are also available in the PGS catalog (PGP000244 and PGP000128). Score IDs are listed in S1 Table in Tanigawa et al., 2022.

Additionally, the Neale Lab has made the summary statistics from their heritability analysis available here.

How was the data processed for analysis?

Data presented in GBE is from the UK Biobank dataset release version 2. To minimize the impact of cofounders and unreliable observations, we used a subset of individuals that satisfied all of the following criteria as described in Tanigawa et al., 2019: (1) self-reported white British ancestry, (2) used to compute principal components, (3) not marked as outliers for heterozygosity and missing rates, (4) do not show putative sex chromosome aneuploidy, and (5) have at most 10 putative third-degree relatives. These criteria are reported by the UK Biobank in the file “ukb_sqc_v2.txt” in the following columns respectively: (1) “in_white_British_ancestry_subset,” (2) “used_in_pca_calculation,” (3) “het_missing_outliers,” (4) “putative_sex_chromosome_aneuploidy”, and (5) “excess_relatives.” We removed 151,169 individuals that did not meet these criteria. Similar criteria were applied to the exome sequencing data from UK Biobank.

We processed summary statistics from Biobank Japan.

We processed summary statistics from the United States' Million Veterans Program.

How were the Genome-wide Association analyses performed?

Genome-wide association analysis was performed with Firth-fallback using PLINK v2.00a (17 July 2017). We used the following covariates in our analysis: age, sex, array type, and the first four principal components, where array type is a binary variable that represents whether an individual was genotyped with UK Biobank Axiom Array or UK BiLEVE Axiom Array. For variants that were specific to one array, we did not use array as a covariate.

What p-value cutoff should I use for GWAS and PheWAS significance?

Current best practices for determining the significance of associations with p-values in genetic association studies require that the significance threshold be adjusted to reflect the number of associations tested, a method known as the Bonferroni correction. For GWAS, 820,897 tests are performed, one for each variant on the array. For PheWAS, 1,766 tests, one for each phenotype tested for each variant. Thus the appropriate p-value cutoffs are 6.0x10^-8 for GWAS and 2.8x10^-5 for PheWAS.

How was the rare variant aggregate analysis performed?

The method used for aggregate analysis shown on GBE is described in detail in our manuscript, “Bayesian Model Comparison for rare variant association studies of multiple phenotypes”. Briefly, we run a model called MRP, which considers the correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. By sharing information across rare variants and phenotypes, we improve our ability to identify rare variants associated with the disease compared to considering a single rare variant and a single phenotype.

Variants were filtered using the variant filter table.tsv file available on GitHub (commit 6f9f726) to filter variants on the UK Biobank array for use with MRP. We first chose variants with the minor allele frequency less than 1%. We then filtered out all variants with all filters less than one. This removes variants with missingness greater than 1% (calculated on an array-specific basis for array-specific variants) or Hardy-Weinberg equilibrium p < 10^-7. This also removes some PTVs for which manual inspection revealed irregular cluster plots. We LD pruned the variants by only using variants with ld equal to one. We included missense variants and PTVs indicated by the following annotations: missense variant, stop gained, frameshift variant, splice acceptor variant, splice donor variant, splice region variant, start lost, and stop lost.

How should I interpret the Bayes Factor?

The Bayes Factor (BF) is a scoring method used to convey the confidence of one hypothesis over another, i.e., the alternative hypothesis over the null hypothesis. We present a log BF as a measure of support for the results of the rare variant aggregate analyses. In practice, there is no threshold that indicates significance for Bayes Factors, unlike p-values. However, a log BF greater than 3 indicates moderate evidence for the alternative hypothesis. See Kass & Raftery (1995) for a thorough discussion on Bayes Factors.

What do the different variables mean in the Genetic Correlation App?

The purpose of the Genetic Correlation App is to display genetic correlation estimates from the multivariate polygenic mixture model (MVPMM). Users can select phenotypes that are available in GBE from the search box at the bottom of the page.

The following is a description of each of the relevant variables within the application.

gcorr is the genetic correlation estimate for a pair of phenotypes.
gcorr SE is the standard error of the genetic correlation estimate.
z provides the z-score for the genetic correlation between the two phenotypes.
pi2 indicates the fraction of variants included in the model that are predicted to have non-zero effects on the pair of phenotypes.
tau1 and tau2 are estimates of the scale of the genetic effects. They provide information about the relative size of the genetic effects between the two phenotypes.

Users can filter by z-score, pi2, genetic correlation, and phenotype category.

For a video walkthrough of the application, please see this youtube video.

How are the phenotypes defined?

Cancer phenotypes (cancer phenotype group)

We combined cancer diagnoses from the UK Cancer Register with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for cancer GWAS. Individual-level ICD-10 codes from the UK Cancer Register, Data-Field 40006, and the National Health Service, Data-Field 41202, in the UK Biobank were mapped to the self-reported cancer codes, Data-Field 20001. The mapping was performed via manual curation of ICD-10 codes for each of the self-reported cancer codes. UKB field codes for self-reported cancer were created with a tree structure such that more specific cancer subtypes (e.g., “malignant melanoma”) are nested under more general categories (“skin cancer”). This tree structure was preserved in the field code to ICD-10 mapping. For example, the self-reported phenotype of “lip cancer” was mapped to its field code, 1010, and the ICD-10 codes for “malignant neoplasm of lip”, C00 and C000-C009. After this mapping, individuals with an affirmative entry in one or more of the phenotype collections (self-reported cancer, cancer registry, and the NHS) were included in the case cohort for the GWAS. No secondary neoplasms were included in the cancer phenotype mappings.

High confidence phenotype definitions (Disease_outcome phenotype group)

We combined disease diagnoses from the UK National Health Service Hospital Episode Statistics with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for non-cancer phenotypes. We used the following procedure to define cases and controls for non-cancer phenotypes (referred to as “high confidence” phenotypes). ICD-10 codes (Data-Field 41202) were grouped with self-reported non-cancer illness codes (Data-Field 20002) that were closely related. This was done by first creating a computationally generated candidate list of closely related ICD-10 codes and self-reported non-cancer illness codes, then manually curating the matches. The computational mapping was performed by calculating the token set ratio between the ICD-10 code description and the self-reported illness code description using the FuzzyWuzzy python package. The high-scoring ICD-10 matches for each self-reported illness were then manually curated to ensure high confidence mappings. Manual curation was required to validate the matches because fuzzy string matching may return words that are similar in spelling but not in meaning. For example, to create a hypertension cohort, the code description from Data-Field 20002 (“Hypertension”) was mapped to all ICD-10 code descriptions, and all closely related codes were returned (“I10: Essential (primary) hypertension” and “I95: Hypotension”). After manual curation, code I10 would be kept and code I95 would be discarded. The following paper describes more about the disease outcome phenotypes.

C. DeBoever, Y. Tanigawa, M. Aguirre, G. McInnes, A. Lavertu, M. A. Rivas, Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases. The American Journal of Human Genetics. 106, 611-622 (2020). doi:10.1016/j.ajhg.2020.03.007

Family history phenotype definitions (Family_history phenotype group)

We used data from Category 100034 (Family history–Touchscreen–UK Biobank Assessment Centre) to define “cases” and controls for family history phenotypes. This category contains data from the touchscreen questionnaire on questions related to family size, sibling order, family medical history (of parents and siblings), and age of parents (age of death if died). We focused on Data Coding 20107: Illness of father and 20110: Illness of mother.

Blood and urine Biomarkers (Biomarkers phenotype group)

We applied technical covariate correction for 35 blood and urine biomarker phenotypes. Those phenotypes are listed under the "Biomarkers" group, and a detailed description of the phenotyping procedure is explained in the following manuscript:

N. Sinnott-Armstrong*, Y. Tanigawa*, et al., Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat Gen. 53(2), 185-194 (2021). doi:10.1038/s41588-020-00757-z (full text via ReadCube)

Other quantitative and binary phenotypes

We defined quantitative and binary phenotypes using data fields in UK Biobank. Typically, we used one data field to derive a phenotype in Global Biobank Engine. Sometimes, we defined multiple phenotypes in Global Biobank Engine from a single data field in UK Biobank. This is the case, for example, when the source UK Biobank field contains an answer for categorical traits.

For example, field 50 in UK Biobank represents standing height. As you can see on the field description page, each individual may have up to 4 measurements across multiple time points. Using the information in this field, we defined Standing height (GBE phenotype code: INI50) phenotype by taking the median of non-NA values as described in the following publication:

Y. Tanigawa*, J. Li*, J. M. Justesen, H. Horn, M. Aguirre, C. DeBoever, C. Chang, B. Narasimhan, K. Lage, T. Hastie, C. Y. Park, G. Bejerano, E. Ingelsson, M. A. Rivas, Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nat Commun. 10, 4064 (2019). doi:10.1038/s41467-019-11953-9

We processed other quantitative phenotypes in the same way (i.e. the non-NA median was taken).

For non-quantitative fields in UK Biobank, we performed manual curation to define case and control. Sometimes, we split categorical results in a single field in UK Biobank into a series of binary traits. For cancer, family history, and disease outcome phenotypes, please see the description above for how we defined the case and controls. For other phenotypes, we will update the description of those phenotypes in the future.

How are the phenotypes grouped?

We provide the phenotype groupings in Global Biobank Engine. Those phenotype groupings are listed in the PheWAS plot on the variant page, for example. The disease outcomes, cancers, family history information, and biomarker phenotypes, are grouped as one category for each. For other phenotypes, we used a modified version of "Primary Category of Origin" information in the field browser in UK Biobank. Specifically, we removed several bottom-level specific groupings so that we have a moderate number of groups.

For standing height, for example, the source UK Biobank Field has the following "Primary Category of Origin" information: UK Biobank Assessment Centre / Physical measures / Anthropometry / Body size measures. Using this information, we classified our Standing height (GBE phenotype code: INI50) phenotype in the Anthropometry group.

Frequently Asked Questions

What's on this page?