GWAS原理和流程全基因组关联分析
Linkagedisequilibrium(LD)连锁不。。。
GWAS⼊门必看教程:
名词解释和基本问题:
关联分析:就是AS的中⽂,全称是GWAS。应⽤基因组中数以百万计的单核苷酸多态;SNP为分⼦遗传标记,进⾏全基因组⽔平上的对照分析或相关性分析,通过⽐较发现影响复杂性状的基因变异的⼀种新策略。在全基因组范围内选择遗传变异进⾏基因分析,⽐较异常和对照组之间每个遗传变异及其频率的差异,统计分析每个变异与⽬标性状之间的关联性⼤⼩,选出最相关的遗传变异进⾏验证,并根据验证结果最终确认其与⽬标性状之间的相关性。
连锁不平衡:LD,P(AB)= P(A)*P(B)。不连锁就独⽴,如果不存在连锁不平衡——相互独⽴,随机组合,实际观察到的体中单倍体基因型 A和B 同时出现的概率。P (AB) = D + P (A) * P (B) 。D是表⽰两位点间LD程度值。
曼哈顿图:在⽣物和统计学上,做频率统计、突变分布、GWAS关联分析的时候,我们经常会看到⼀些⾮常漂亮的manhattan plot,能够对候选位点的分布和数值⼀⽬了然。位点坐标和pvalue。map⽂件⾄少包
含三列——染⾊体号,SNP名字,SNP物理位置。assoc⽂件包含SNP名字和pvalue。haploview即可画出。
SNP的本质属性是什么?⼴义上讲是变异:most common type of genetic variation,平级的还有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狭义上讲是标记:biological markers,因为SNP是单碱基的,所以SNP⼜是⼀个位点,标记了染⾊体上的⼀个位置。⼤部分⼈的基因组,99%都是⼀模⼀样的,还有些SNP的位点,就是⼀些可变的位点,在⼈中有差异。这些差异/标记可以⽤于疾病的分析,根据统计学原理,出与疾病最相关的位点,从⽽确定某个疾病的risk allele。
SNP array是如何⼯作的?SNP array测得不是单个碱基,⽽是allele。所以GWAS的结果是三种:(1 - AA; 2 - AB; 3 - BB),也可能是0、1、2.
linkage disequilibrium (LD)和 pairwise correlation的区别?
如何鉴定Somatic vs Germline Mutations?In multicellular organisms, mutations can be classed as either somatic or germ-line。必须做通常需要trios或healthy tissue的测序才能确定。最显然的是cancer⾥⼤部分都是somatic的variations。
SNP、variant和mutation有什么区别?SNP是中性的,mutation显然和疾病相关;其次就是频率,频率很⾼的是SNP,mutation则很低。variant和variation是同义词,因此和SNP是等价的。
为什么还需要haplotype?HapMap计划的动机是什么?The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.
common variant和rare variant是根据什么来区别的?怎么理解这⾥的common和rare?variant就是SNP,”常见的变异“,SNP就是位点,⼀个位点怎么能说常见和不常见呢?这⾥是有点反直觉的。这⾥的common说的是minor allele,就是the second most common allele。⽐如⼀个,它的位置可知,在不同⼈中的allele frequency可知,总体的MAF是0.39 (T)。⼀个SNP的MAF<1%,那就是rare variant。直觉理解就是这个位点的碱基在⼈中很少发⽣变化。rare variants (MAF < 0.05) appeared more frequently in coding regions than common variants (MAF > 0.05) in this population
Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF <1%) of small to modest effect, or a combination of both, with several possible scenarios all deemed plausible in simulation studies [14].
为什么genetic这么执着于MAF?
fine是什么意思因为从进化⾓度,risk allele更有可能是minor allele,⾃然选择。不绝对,但可以说是富集。看⽂章:
common variants together account for a small proportion of heritability estimated from family studies,common variants通常都在⾮编码区,占总variants的很⼩⼀部分,同时effect size也⽐较低。
SNP的small effect和large effect是什么意思?effect size
极其容易搞混的术语:SNP、mutation、variant、allele、genotype。Allele frequency、Genotype frequency,alternative allele frequency、MAF。⼀定要能快速区分这些术语的差异,否则你做的就是假的统计遗传学。
gene-based rare-variant burden tests是⽤来⼲什么的?Increased Burden of Rare Variants Among S-HSCR。
epistatic effects是什么?
为什么说L-HSCR是autosomal dominant?很难说是完全的线性,显隐性的关系是⾮常复杂的,存在不完全和剂量效应。
DNA序列⾓度如何看待等位基因,显隐性的关系?,allele在基因上的组合,传统的等位基因是⾮常抽象的概念。我们是两倍体,对每个基因来说,我们都有两个等位基因,杂合的话,这两个基因序列就不同,表达出来的蛋⽩也就不同,⽽且两个等位基因有复杂的显隐性关系。所以说我们传统的基因表达分析其实是很粗糙的,最好要做到isoform层次的表达,毕竟基因离蛋⽩还是有⼀段距离。现在之所以还没做到isoform⽔平,⼤部分原因是我们对蛋⽩的研究还不够。
⼀个新的课题,全球范围内,⼈种是如何逐步分化到今天,哪些核⼼的遗传因素决定了⼈种的表型差异;其次,不同的⼈种在某些疾病上为什么会出现显著的频率差异,为什么亚洲⼈的HSCR发病率会更⾼?遗传因素在其中发挥了什么作⽤?
遗传效应:Additive genetic effects occur when two or more genes source a single contribution to the final phenotype, or when alleles of a single gene (in heterozygotes) combine so that their combined effects equal the sum of their individual effects.[1][2] Non-additive genetic effects involve dominance (of alleles at a single locus) or epistasis (of alleles at different loci). 就是risk allele的数量和患病率之间成正⽐。⼈类基因组⾥有多少个variant/SNP?⾥的数据是84.4 million,这是保守数据,因为只包括了2504个⼈,相当于每个population只测了100个⼈,虽然具有⼀定的代表,性,但实际肯定更多,那就保守估计⼀下300 million吧,那就真是百分之⼀了,也就是100个碱基⾥就有⼀个variant。算到个体,就是3 million左右,也就是万分之⼀。
先从直觉上理解⼀下GWAS的原理:
核⼼就是SNP与表型的关联,对于每⼀个genome位点,如果某个SNP总是与某疾病同时出现 SNP与phenotype这两个维度协同变化,那我们就可以推测这个SNP极有可能与此phenotype(疾病)相关。
规范点讲就是看某个SNP在case和control两个population间是否有allel frequency的显著差异。
⽽现实情况是,我们样本数有限,⽽且有时候control和case样本不平衡,样本还分男⼥、⼈,⽽且我们需要对3亿个碱基位点都做统计检验。
我们应该设计哪些指标来评价⼀个snp与表型的关联呢?
思考:如果⼀个位点有多个SNP,⽽只有其中的⼀个SNP与疾病相关怎么办?错误认知,⼀个基因组位点只能有⼀个SNP,可以有很多种allele。
牢记:曼哈顿图中的点代表的不是样品,⽽是SNP。
思考:曼哈顿图中,显著的SNP并不是鹤⽴鸡的冒出来,⽽是似乎被捧出来的,就像⾼楼⼤厦⼀样,从底下逐步冒出来的。这⼀座⼤厦其实就是连锁在⼀起的SNP,具有很⾼的LD score。
思考:虽然曼哈顿图⾥每个点是SNP,但是通常都会把最显著的SNP指向某个基因,因为⼤家最关注的还是SNP的致病根源,但这样出来的只有编码区的SNP。
注意:最突出的SNP极有可能不是causal SNP,它只是near the causal SNP。问题就来了,怎么causal SNP呢?fine mapping
基本背景
什么是SNP?进化过程中随机产⽣的单点突变,并能稳定的在体中遗传。
什么是allele frequency in population?每⼀个genome位点都有两个或多个allele,不同allel之间有明显的频率上的差异,简单点理解就是A 和a两个性质的频率,但这⾥是碱基位点,⽽不是性状基因。
GWAS分析的前提
sample size⾜够,学过统计的都知道sample size会影响power,没有⾜够的power是得不出正确结论的,GWAS通常需要⼤量的样本,⼏千是标配,⼏百就太少,现在有的都达到了⼏万⼏⼗万级别;
⼀个⼤误区就是GWAS会测全基因组WGS,其实不是的,那太贵了,⼤部分是做DNA chip DNA芯⽚(专业的叫SNP array),只包含了常见的10^6个SNP。稍微有钱的就会上WES,就会得到所有编码区的SNP;最有钱的就是WGS了,全部检测,编码⾮编码,常见罕
见,1000genome就是靠这个才NB的。
⼤致原理已经讲了,其实还有统计原理,暂时略过,先看实操。
怎么⽤PLINK来做GWAS?油管视频:⾥⾯有paper、⽰例数据、代码下载,可以跑跑熟悉⼀下。
参考:
Genotype Calling () and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays 发表了paper的,GWAS pipeline:。
⼀下着重讲解⼀下这个流程的操作细节:
主要是四⽅⾯的分析:
1. All essential GWAS QC steps along with scripts for data visualization.
2. Dealing with population stratification, using 1000 genomes as a reference.
3. Association analyses of GWAS data.
4. Polygenic risk score (PRS) analyses.
先看下PLINK的⽂本⽂件格式:
ped:⾏是个体,列是表型和SNP的基因型数据;
map:snp的特征数据;
⼆进制有三个格式:
主要就是把ped拆成了fam和bed,map变成了bim。
通常要做covariate分析,所以还有个covariate⽂件。
QC:
Step Command Function
1: Missingness of SNPs and individuals ‐‐geno Excludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed.
‐‐mind Excludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed.
2: Sex discrepancy‐‐check‐sex Checks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates.
3: Minor allele
frequency (MAF)
‐‐maf Includes only SNPs above the set MAF threshold.
4: Hardy–Weinberg
equilibrium (HWE)
‐‐hwe Excludes markers which deviate from Hardy–Weinberg equilibrium.
5: Heterozygosity For an
example
script see
Excludes individuals with high or low heterozygosity rates
6: Relatedness‐‐genome Calculates identity by descent (IBD) of all sample pairs.
‐‐min Sets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning
that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected.
7: Population stratification ‐‐genome Calculates identity by descent (IBD) of all sample pairs.
‐‐cluster ‐‐
mds‐plot k
Produces a k‐dimensional representation of any substructure in the data, based on IBS.
fine mapping
⼀个常识就是GWAS是2007年才出现得,所以2017年才出了篇有名的综述ten years of GWAS,fine mapping是GWAS后才出现得。
实验室很早就开始研究fine mapping了:2009 -
看⼀下introduction,什么是fine mapping?
⽬的很简单:GWAS到的⼤多不是causal variants,fine mapping就是就fill这个gap。
GWAS得到⼤体的SNP后,必须做两⽅⾯的深⼊分析:
第⼀步就是对SNP给⼀个概率上的causality,这就是fine-mapping;第⼆步就是根据功能注释来确定该SNP确实能导致某个基因。The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation.
基本原理:
-
Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited nu
mber of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity. eQTL最⼤的问题还是celltype的特异性不够,关键还是要celltype的定义⾜够精准!
现在GWAS已经属于⽐较古⽼的技术了,主要是碰到严重的瓶颈了,单纯的snp与表现的关联已经不够,需要具体的⽣物学解释,这些snp 是如何具体导致疾病的发⽣的。
⽽且,⼤多数病到的都不是个别显著的snp,⼤多数都到了很多的snp,⽽且snp都落在⾮编码区了,这就导致对这些snp的解读⾮常的困难。
经典解读看这篇新英格兰杂志上的⽂章:
GWAS的核⼼结果就两个,曼哈顿图和QQ-plot,看懂就够了。
单纯会跑GWAS pipeline已经没什么价值了,现在重在下游的分析,有⼏个热点:
Polygenic risk score (PRS) analyses
meta-analysis
common和rare就是根据allele frequency来界定的,但是似乎没有明确界限。
HapMap⽤的是array,所有测得都是⼀些⼈为挑的点,所以就是common snps;⽽1000 genomes是WGS,所以包含了所有的点,所以有common和rare⼀起。
GWAS和核⼼就是LD,⽬前⼤部分的GWAS都是测得array,因为便宜。
GWAS会漏掉很多点,所以才会有fine-mapping,根据haplotype来做⼀些imputation。
Linkage disequilibrium (LD)连锁不平衡:不同基因座位的各等位基因在⼈中以⼀定的频率出现。在某⼀体中,不同座位某两个等位基因出现在同⼀条染⾊体上的频率⾼于预期的随机频率的现象。(就是
孟德尔的分离不是随机的,在染⾊体上越靠近的allele越倾向于绑在⼀起,属于物质性的限制。)
例如两个相邻的基因A B, 他们各⾃的等位基因为a b. 假设A B相互独⽴遗传,则后代体中观察得到的单倍体基因型 AB 中出现的P(AB)的概率为 P(A) * P(B).实际观察得到体中单倍体基因型 AB 同时出现的概率为P(AB)。计算这种不平衡的⽅法为: D = P(AB)- P(A) * P(B).
事实上,可以检测遍布基因组中的⼤量遗传标记位点snp,或者候选基因附近的遗传标记来寻到因为与致病位点距离⾜够近⽽表现出与疾病相关的位点,这就是等位基因关联分析或连锁不平衡定位基因的基本思想。
待看的paper:
assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.
还有⼀些⾮常重要的概念:
effect size:效应量
power:功效,power analyses
Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes
在语境⾥理解:One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes.
PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk
haplotype phasing单倍体分型
Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.
Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.
参考: |
vcftools
plink的主要功能:数据处理,质量控制的基本统计,体分层分析,单位点的基本关联分析,家系数据的传递不平衡检验,多点连锁分析,单倍体关联分析,拷贝数变异分析,Meta分析等等。
⾸先必须了解plink的三种格式:bed、fam和bim。(注意:这⾥的bed和我们genome⾥的区域⽂件bed完全不同)
plink需要的格式⼀般可以从vcf⽂件转化⽽来 (顺便了解⼀下ped和map两种格式):
PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 谱系信息和基因型信息。每⼀⾏是⼀个⼈。
MAP: Variant information file accompanying a .ped text pedigree + genotype table. 变异信息。每⼀⾏是⼀个变异 | snp。
# PED
1 1 0 0 1  0    G G
2 2    C C
1 2 0 0 1  0    A A    0 0    A C
1 3 1
2 1  2    0 0    1 2    A C
2 1 0 0 1  0    A A    2 2    0 0
2 2 0 0 1  2    A A    2 2    0 0
2 3 1 2 1  2    A A    2 2    A A
# MAP
1 snp1 0 1
1 snp
2 0 2
1 snp3 0 3
# vcf转ped和map
plink --vcf file.vcf --recode --out file
# ped和map转bed、bim和fam
plink --file test --make-bed --out test
bed⽂件(真实的bed⽂件是⼆进制的,⽐较难读)
bed:Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。所以转换后就是⼀个matrix,每⼀⾏是⼀个个体,每⼀列就是⼀个变异。其中0、1、2分别对应了aa、Aa或aA和AA。不考虑碱基型,因为我们不关注ATGC的变化。
fam:Sample information file accompanying a .bed binary genotype table. 样本信息。每⼀⾏就是⼀个样本。
bim:Extended variant information file accompanying a .bed binary genotype table. 每⼀⾏是⼀个变异,及其注释信息。
rs4970383 rs3748592 rs9442373 rs1571150 rs6687029