The UK Biobank resource with deep phenotyping and genomic data

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Understanding the role that genetics has in phenotypic and disease variation, and its potential interactions with other factors, is crucial for a better understanding of human biology. It is hoped that this will lead to more successful drug development1, and potentially to more efficient and personalized treatments. As such, a key component of the UK Biobank resource has been the collection of genome-wide genetic data on every participant using a purpose-designed genotyping array2. An interim release of genotype data on approximately 150,000 UK Biobank participants in May 20153 has already facilitated numerous studies4,5,6.

In this paper, we summarize the existing and planned content of the phenotype resource and describe the genetic dataset on the full 500,000 participants. To facilitate its wider use, we applied a range of quality control procedures and conducted a set of analyses that reveal properties of the genetic data—such as population structure and relatedness—that can be important for downstream analyses. In addition, we estimated haplotypes and imputed genotypes into the dataset that increases the number of testable variants by more than 100-fold to approximately 96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and replicated signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies, which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.


UK biobank 收集了英国的50万人的样本,占到英国总人口的1%,参与者在40-49岁之间,有很多的表型信息和健康相关的记录,包括生物学测量指标,生活方式指标,血液和尿液中的生物标记以及身体和大脑的成像。通过链接健康和医疗记录来提供后续信息。已经收集了所有参与者的全基因组基因型数据,为发现新的遗传关联和复杂性状的遗传基础提供了基础。本文描述了遗传数据的集中分析,包括基因分型质量,种群结构的性质和遗传数据的相关性,以及有效的定相和基因型推算,使可测变体的数量增加到9600万左右。估算了11个人类白细胞抗原基因的经典等位基因变异,这些变异被认为导致了已知的人类白细胞抗原等位基因和许多疾病之间的信号恢复。

理解遗传因素在表型和疾病变异中的作用及其与其他因素的潜在相互作用,对于更好地了解人类生物学至关重要。有希望促进更成功的药物开发,并可能带来更有效和个性化的治疗。因此,英国生物库资源的关键组成部分是使用专门设计的基因分型阵列,在每个参与者上收集全基因组范围的遗传数据。 2015年5月,大约有15万英国生物库参与者的基因型数据临时发布,已经为众多研究提供了便利。

在本文中,我们总结了表型资源的现有和计划内容,并描述了全部50万参与者的遗传数据集。为了促进其广泛使用,我们应用了一系列质量控制程序并进行了一系列分析,揭示了遗传数据的属性(例如种群结构和相关性),这对于下游分析很重要。此外,我们估计了单倍型和估算的基因型进入数据集,使可测试突变的数量增加了100倍以上,达到约9600万个突变。我们还估算了11个人类白细胞抗原(HLA)基因的经典等位基因变异,并复现了HLA等位基因与许多常见疾病之间已知关联的信号。本文还描述了允许对多种性状进行有效的全基因组关联研究(GWAS)和快速表型关联研究的工具。为了进一步检查基因分型和估算的数据集,作者们对一个已被充分研究的人类特征——站立高度(身高),进行了一个test-case全基因组关联扫描研究验证。


See you tomorrow