⼀篇⽂章学会miRNA-seq分析
第⼀讲:⽂献选择与解读
前阵⼦逛BioStar论坛的时候看到了⼀个关于miRNA分析的问题,提问者从NCBI的SRA中下载⽂献提供的原始数据,然后处理的时候出现了问题。我看到他列出的数据来⾃iron torrent测序仪,⽽且我以前也没有做过miRNA-seq的数据分析,就⾃学了⼀下。因为我有RNA-seq的基础,所以理解学习起来⽐较简单。
在这⾥记录⾃⼰的学习过程,希望对需要的朋友有帮助。
这⾥选择的⽂章是2014年发表的,作者⽤ET-1刺激human iPSCs (hiPSC-CMs) 细胞前后,观察miRNA和mRNA表达量的变化,我并没有细看⽂章的⽣物学意义,仅仅从数据分析的⾓度解读⼀下这篇⽂章,mRNA表达量⽤的是Affymetrix Human Genome U133 Plus 2.0 Array,分析起来特别容易。得到表达矩阵,然后⽤limma这个包差异表达基因即可。
但是miRNA分析起来就有点⿇烦了,作者⽤的是iron torrent测序仪。不过本次从SRA数据中⼼下载的是已经去掉接头的fastq格式测序数据,所以这⾥其实并不需要考虑测序仪的特异性。
关于该⽂章的⼏个资料
paper : /plosone/article?id=10.1371/journal.pone.0108051
Aggarwal P, Turner A, Matter A, Kattman SJ et al. RNA expression profiling of human iPSC-derived cardiomyocytes in a cardiac hypertrophy model. PLoS One 2014;9(9):e108051. PMID: 25255322
The accession numbers are 1. SuperSeries (mRNA+miRNA) - GSE60293
mRNA expression array - GSE60291 (Affymetrix Human Genome U133 Plus 2.0 Array)
miRNA-Seq - GSE60292 (Ion Torrent)
GEO : bi.v/geo/i?acc=GSE60292
FTP : ftp://bi.v/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045420
接下来我们要知道⽂章做了哪些分析,然后才能⾃⼰模仿看是否可以得到同样的分析结果。
⽂章数据处理流程
Ion Torrent's Torrent Suite version 3.6 was used for basecalling
Raw sequencing reads were aligned using the SHRiMP2 aligner and were aligned against the human reference genome (hg19) for novel miRNA prediction and then against a custom reference sequence fi
le containing miRBase v.20 known human miRNA hairpins, tRNA, rRNA, adapter sequences and predicted novel miRNA sequences.(Genome_build: hg19, miRBase v.20 human miRNA hairpins)
The miRDeep2 package (default parameters) was used to predict novel (as yet undescribed) miRNAs
Alignments with less than 17 bp matches and a custom 3′ end phred q-score threshold of 17 were filtered out.
miRNA quanitification was done using HTSeq v0.5.3p3 using the default union parameter. Differential miRNA expression was analyzed using the DESeq (v.1.12.1) R/Bioconductor package
In this study, differentially expressed genes that had a false discovery rate cutoff at 10% (FDR<=0.1), a log2 fold change greater than 1.5 and less than −1.5 were considered significant.
Target gene prediction was performed using the TargetScan (version 6.2) database
We also used miRTarBase (version 4.3), to identify targets that have been experimentally validated
miR-Deep2 and miReap predict exact precursor sequence according from mature sequence
参考基因组(两条⽐对线路),获得
获得miRNA表达量,miRNA预测,miRNA靶基因预测。
质量控制标准,数据⽐对⼯具,⽐对的
⽂章提到了fastq数据质量控制标准,
数据⽐对⼯具,⽐对的参考基因组(两条⽐对线路),
这也是我们学习miRNA-seq的数据分析的标准套路,⽽且作者给出了所有的分析结果,我们完全可以通过⾃⼰的学习来重现他的分析过程。
Supplementary files format and content: tab-delimited text files containing raw read counts for known mature human miRNAs.(表达矩阵)
We detected 836 known human mature miRNAs in the control-CMs and 769 in the ET1-CMs
Based on our miRNA-Seq data, we predicted 506 sequences to be potentially novel, as yet undescribed miRNAs.
In order to validate the expression profiles of the miRNAs detected, we performed RT-qPCR on a subset of five known human mature and five of our predicted novel miRNAs.
we obtained a total of 1,922 predicted miRNA-mRNA pairs represented by 309 genes and 174 known mature human miRNAs.
当然仅仅是套路分析还不够,所以他进⾏了 miRNA和mRNA 进⾏⽹络分析并做了少量湿实验来验证,最后还扯了⼀些⽣物学意义。
第⼆讲:搜集学习资料
因为我是完全从零开始⼊门miRNA-seq分析,所以收集的资料⽐较齐全。
了解miRNA测序是怎么回事,该分析什么,然后主要围绕着第⼀讲⽂献⾥的分析步骤来搜索资料。
⾸先看了部分中⽂资料,了解
miRNA定义
我⾸先到了miRNA定义:/content/34/suppl_1/D135.full
MicroRNAs (miRNAs) are small RNA molecules, which are ∼22 nt sequences that have an important role in the translational regulation and degradation of mRNA by the
base's pairing to the 3′-untranslated regions (3′-UTR) of the mRNAs. The miRNAs are derived from the precursor transcripts of ∼70–120 nt sequences, which fold to form as stem–loop structures, which are thought to be highly conserved in the evolution of genomes. Previous analyses have suggested that ∼1% of all human genes are miRNA genes, which regulate the production of protein for 10% or more of all human coding genes。
选择参考序列
然后我⽐较纠结的问题是参考序列如何选择,因为miRNA序列很少,把它map到3G⼤⼩的⼈类基因组有点浪费计算资源,正好我的服务器⼜坏了,不想太⿇烦,想⽤⾃⼰的个⼈电脑搞定这个学习过程。
我看到很多帖⼦提到的都是⽤bowtie⽐对到参考
参考miRNA数据库(miRNA count: 28645 entries)/,从这个数据库,我明⽩了前体miRNA和成熟的miRNA的区别,前体miRNA长度⼀般是70–120 碱基,⼀般是茎环结果,也就是发夹结构,所以叫做hairpin。成熟之后,⼀般22 个碱基,在miRNA数据库很容易下载到这些数据,⽬前⼈类这个物种已知的成熟miRNA共有2588条序列,⽽前体miRNA共有1881条序列,我下载(下载时间2016年6⽉)的代码如下所⽰:
wget ftp:///pub/mirbase/CURRENT/  ## 28645 readswget ftp:///pub/mirbase/CURRENT/mature.fa.zip  ##  35828 reads wget
perl怎么下载
ftp:///pub/mirbase/CURRENT/hairpin.fa.zipwget ftp:///pub/mirbase/CURRENT/genomes/hsa.gff3 ##wget ftp:///pub/mirbase/CURRENT/miFam.dat.zipgrep sapiens mature.fa |wc   # 2588 grep sapiens hairpin.fa |wc      # 1881 ## Homo sapiensperl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' hairpin.fa >hairpin.human.faperl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' mature.fa >mature.human.fa# 这⾥值得⼀提的是miRBase数据库下载的序列,居然都是⽤U表⽰的,也就是说就是miRNA序列,⽽不是转录成该miRNA的基因序列,⽽我们测序的都是基因序列。
通过这个代码制作的hairpin.human.fa 和 mature.human.fa 就是本次数据分析的参考基因组。
在搜集资料的过程中,我看到了⼀篇⽂献讲挖掘1000genomes的数据到位于miRNA的snp位点
genomemedicine.biomedcentral/articles/10.1186/gm363
看起来⽐较新奇,不过跟本次学习过程没有关系,我就是记录⼀下,有空回来学习学习。
博客讲解如何分析miRNA数据
genomespot.blogspot/2013/08/quick-alignment-of-microrna-seq-data-to.html
公司数据分析流程:
bioinfo5.ugr.es/miRanalyzer/miRanalyzer_tutorial.html
www.partek/sites/default/files/Assets/UserGuideMicroRNAPipeline.pdf
partek/Tutorials/microarray/microRNA/miRNA_tutorial.pdf
www.arraystar/reviews/microrna-sequencing-data-analysis-guideline/
bioinfo5.ugr.es/sRNAbench/sRNAbench_tutorial.pdf
adthedocs.io/mirna_annotation.html
耶鲁⼤学
www.yale.edu/giraldezlab/miRNA.html
南⽅基因
www.southgene/newsshow.php?cid=55&id=73
miRNA研究整套⽅案
wenku.baidu/view/5f38577a31b765ce05081429.html?re=view
Biostar 讨论帖⼦
/p/3344/
/p/98486/
miRNA-seq数据处理实战指南
/content/early/2015/04/17/bib.bbv019.full
直接⽤⼀个包搞定
/packages/release/bioc/html/easyRNASeq.html
github流程:miRNA Analysis Pipeline v0.2.7
github/bcgsc/mirna/tree/master/v0.2.7
tools.thermofisher/content/sfs/manuals/CO25176_0512.pdf
miRNA annotation
adthedocs.io/mirna_annotation.html
⽹页版分析⼯具
/projects/clsi/images/2/2f/HTS2014miRNA analysis Lifeportal14final.pdf
aining.prace-ri.eu/uploads/tx_pracetmo/NGSdataAnalysisWithChipster.pdf
可视化IGV User Guide
/igv/book/export/html/6
⽐较特殊的是新的miRNA预测,miRNA靶基因预测,这块软件太多并没有成型的流程和标准。
第三讲:下载公共数据
前⾯已经讲到了该⽂章的数据已经上传到NCBI的SRA数据中⼼,所以直接根据索引号下载,然后⽤SRAtoolkit转出我们想要的fastq测序数据即可。
下载的数据⼀般要进⾏质量控制,可视化展现⼀下质量如何,然后根据⼤题测序质量进⾏简单过滤。所以需要提前安装⼀些软件来完成这些任务,包括:sratoolkit /fastx_toolkit
/fastqc/bowtie2/hg19/miRBase/SHRiMP
下⾯是我⽤新服务器下载安装软件的⼀些代码记录,因为fastx_toolkit /fastqc我已经安装过,就不列代码了
## pre-step: download sratoolkit /fastx_toolkit_0.0.13/fastqc/bowtie2/hg19/miRBase/SHRiMP## bi.v/Traces/i?view=software##
bi.v/books/NBK158900/## 我这⾥特意挑选的⼆进制版本程序下载的,这样直接解压就可以⽤,但是需要挑选适合⾃⼰的操作系统的程序。cd ~/biosoftmkdir sratoolkit &&  cd sratoolkitwget bi.v/sra/sdk/2.6.3/sratoolkit.2.6.3-centos_####  Length: 63453761 (61M) [application/x-gzip]##  Saving to: "sratoolkit.2.6.3-
centos_"tar zxvf sratoolkit.2.6.3-centos_d ~/biosoftmkdir bowtie &&  cd bowtiewget sourceforge/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-
x86_64.zip/download#Length: 27073243 (26M) [application/octet-stream]#Saving to: "download"mv download  bowtie2-2.2.9-linux-x86_64.zipunzip bowtie2-2.2.9-linux-x86_64.zip##
x86_64.zip/download#Length: 27073243 (26M) [application/octet-stream]#Saving to: "download"mv download  bowtie2-2.2.9-linux-x86_64.zipunzip bowtie2-2.2.9-linux-x86_64.zip##
o.edu/shrimp/mkdir SHRiMP &&  cd SHRiMPwget o.edu/shrimp/releases/SHRiMP_2_2_3.lx26.x86_ar zxvf SHRiMP_2_2_3.lx26.x86_ cd SHRiMP_2_2_3export SHRIMP_FOLDER=$PWD  ## 这个软件使⽤的时候⽐较奇葩,需要设置到环境变量,不能简单的调⽤全路径
SHRiMP这个软件⽐较⼩众,我也是第⼀次听说过。
本来我计划是能⽤bowtie搞定,但是第⼀次⽐对出了⼀个bug,就是下载的miRNA序列⾥⾯的U没有转换成T,所以导致⽐对率⾮常之低。于是我不得不根据⽂章⾥⾯记录的软件SHRiMP 来做⽐对,最后发现⽐对率完全没有改善,搞得我都在怀疑是不是作者乱来了。
下⾯是下载数据,质量控制的代码,希望⼤家可以照着运⾏⼀下。
## step1 : download raw datamkdir miRNA_test && cd miRNA_testecho {14..19} |sed 's/ /\n/g' |while read id; \do  wget "ftp://bi.v/sra/sra-
instant/reads/ByStudy/sra/SRP/SRP045/SRP045420/SRR15427$id/SRR15427$id.sra"  ;\done## step2 :  change sra data to fastq files.## 主要是⽤shell脚本来批量下载ls *sra |while read id; do
~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump $id;donerm *sra##  33M --> 247M#Read 1866654 spots for SRR1542714.sra#Written 1866654 spots for SRR1542714.sra## step3 : download the results from paper## www.bio-info-trainee/1571.html## ftp://bi.v/geo/series/GSE1nnn/GSE1009/suppl/GSE1009_RAW.tarmkdir paper_results && cd
paper_resultswget ftp://bi.v/geo/series/GSE60nnn/GSE60292/suppl/GSE60292_RAW.tar## tar xvf GSE60292_RAW.tarls *gz |while read id ; do (echo $id;zcat $id | cut -f 2 |perl -alne
'{$t+=$_;}END{print $t}');donels *gz |xargs gunzip## step4 : quality assessmentls *fastq | while read id ;
do ~/biosoft/fastqc/FastQC/fastqc $id;done## Sequence length 8-109## %GC 52## Adapter Content passed## write a script : :: cat >filter.shls *fastq |while read iddoecho $id~/biosoft/fastx_toolkit_0.0.13/bin/fastq_quality_filter -v -q 20 -p 80 -Q33  -i $id -o tmp
;~/biosoft/fastx_toolkit_0.0.13/bin/fastx_trimmer -v -f 1 -l 27 -i tmp  -Q33 -z -o ${id%%.*}_ ;donerm tmp## discarded 12%~~49%%ls *_ | while read id ; do
~/biosoft/fastqc/FastQC/fastqc $id;donemkdir QC_resultsmv *zip *html QC_results
这个代码是我⾃⼰根据⽂章的理解写出的,因为我本⾝不擅长miRNA数据分析,所以在进⾏QC的时候参数选择可能并不是那么友好.
~/biosoft/fastx_toolkit_0.0.13/bin/fastq_quality_filter -v -q 20 -p 80 -Q33  -i $id -o tmp ;~/biosoft/fastx_toolkit_0.0.13/bin/fastx_trimmer -v -f 1 -l 27 -i tmp  -Q33 -z -o ${id%%.*}_ ;
最后得到的系列⽂件,就是我需要进⾏⽐对的序列。
第四讲:测序数据⽐对
序列⽐对是⼤多数类型数据分析的核⼼,如果要利⽤好测序数据,⽐对细节⾮常重要,我这⾥只是研读
⼀篇⽂章也就没有对⽐对细节过多考虑,只是列出⾃⼰的代码和⾃⼰的⼏点思考,⼒求重现⽂章作者的分析结果。
对miRNA-seq数据有两条⽐对策略
1. 下载miRBase数据库⾥⾯的已知miRNA序列来进⾏⽐对
2. 直接⽐对到参考基因组(⽐如⼈类的是hg19/hg38)
前⾯的⽐对⾮常简单,⽽且很容易就可以数出已经的所以miRNA序列的表达量;后⾯的⽐对有点耗时,⽽且算表达量的时候也不是很⽅便,但是它的优点是可以来预测新的miRNA,所以⼤多数⽂章都会把这两条路给⾛⼀下。
本⽂选择的是SHRiMP这个⼩众软件,起初我并没有在意,就⽤的bowtie2⽽已,参考基因组就⽤了miRBase数据库下载的⼈类的参考序列。
## step5 : alignment to miRBase v21 (hairpin.human.fa/mature.human.fa )#### step5.1 using bowtie2 to do alignmentmkdir  bowtie2_index &&  cd bowtie2_index~/biosoft/bowtie/bowtie2-
2.2.9/bowtie2-build ../hairpin.human.fa hairpin_human~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ../mature.human.fa  mature_humanls *_ | while read id ; do  ~/biosoft/bowtie/bowtie2-
2.2.9/bowtie2 -x miRBase/bowtie2_index/hairpin_human -U $id  -S ${id%%.*}.hairpin.sam ; done## overall alignment rate:  10.20% / 5.71%/ 10.18%/ 4.36% / 10.02% / 4.95%  (before convert U to T )## overall alignment rate:  51.77% / 70.38%/51.45% /61.14%/ 52.20% / 65.85% (after convert U to T )ls *_ | while read id ; do  ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -x
miRBase/bowtie2_index/mature_human  -U $id  -S ${id%%.*}.mature.sam ; done## overall alignment rate:  6.67% / 3.78% / 6.70% / 2.80%/ 6.55% / 3.23%    (before convert U to T )## overall alignment rate:  34.94% / 46.16%/ 35.00%/ 38.50% / 35.46% /42.41%(after convert U to T )#### step5.2 using SHRiMP to do alignment##    o.edu/shrimp/README##    3.5 Mapping cDNA reads against a miRNA databasecd ~/biosoft/SHRiMP/SHRiMP_2_2_3export SHRIMP_FOLDER=$PWDcd -##  We project the database with:$SHRIMP_FOLDER/utils/project-db.py --seed 00111111001111111100,00111111110011111100,00111111111100111100,00111111111111001100,00111111111111110000 \ --h-flag --shrimp-mode ls
miRBase/hairpin.human.fa##$SHRIMP_FOLDER/bin/gmapper-ls -L  hairpin.human-ls SRR1542716.fastq  --qv-offset 33  \-o 1 -H -E -a -1 -q -30 -g -30 --qv-offset 33 --strata -N 8  >map.out 2>map.log
因为⼀个前体miRNA可以形成多个成熟的miRNA,⽽并不是所有的成熟的miRNA形式都被记录在数据
⼤家可以看到我们把测序reads⽐对到前体miRNA和成熟的miRNA结果是有略微区别的,因为⼀个前体
库,所以⼀般推荐⽐对到前体miRNA数据库,这样还可以预测新的成熟miRNA,也是⾮常有意义的。
另外⾮常重要的⼀点是,把U变成T前后⽐对率差异⾮常⼤,这其实是⼀个⾮常蠢的错误,我就不多说了。但是做到这⼀步,其实可以跟⽂章来做验证,⽂章有提到⽐对率,⽐对的序列。
我也是在博客⾥⾯看到这个信息的:
Thank you so much!. Yes I contacted the lab-guy and he just said that trimmed the first 4 bp and last 4bp. ( as you found) So I firstly trimmed the adapter
sequences(TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC) And then, trimmed the first 4bp and last 4bp from reads, which leads to the 22bp peak of read-length
distribution(instead of 24bp) Anyhow, I tried to map with bowtie2 again.
> bowtie2 --local -N 1 -L 16> -x ../miRNA_reference/hairpin_UtoT.fa> -U first4bptrimmed_A1-SmallRNA_S1_L001_R1_001_Illuminaadpatertrim.fastq> -S f4_trimmed.sam
I also changed hairpin.fa file (U to T) Oh.. thank you David, Finallly, I got
>  2565353 reads; of these:>  2565353 (100.00%) were unpaired; of these:>  479292 (18.68%) aligned 0 times>  11959 (0.47%) aligned exactly 1 time>  2074102 (80.85%) aligned >1 times>  81.32% overall alignment rate
第五讲:获取miRNA表达量
得到⽐对后的sam/bam⽂件只能算是level2的数据,⼀般我们给他⼈share的结果也是直接给表达矩阵的, miRNA分析跟mRNA分析类似,但是它的表达矩阵更好获取⼀点。
如果是mRNA,我们⼀般会跟基因组来⽐较,⽽基因组就是24条参考染⾊体,想知道具体⽐对到了哪个基因,需要根据基因组注释⽂件来写程序提取表达量信息,现在⽐较流⾏的是htseq这个软件,我前⾯也写过教程如何安装和使⽤,这⾥就不啰嗦了。但是对于miRNA,因为我⽐对的就是那1881条前体miRNA序列,所以直接分析⽐对的sam/bam⽂件就可以知道每条参考miRNA序列的表达量了。## step6: counts the reads which mapping to each miRNA reference.## we need to exclude unmapped as well as multiple-mapped  reads## XS:i:<n> Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode## NM:i:1  ## NM i Edit distance to the reference, including ambiguous bases but excluding clipping#The following command exclude unmapped (-F 4) a
s well as multiple-mapped (grep -v “XS:”) reads#samtools view -F 4 input.bam | grep -v "XS:" | wc -l## 180466//1520320##cat >count.hairpin.shls *hairpin.sam  | while read iddosamtools view  -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }'  > ${id%%_*}.untsdone## bash count.hairpin.sh##cat >count.mature.shls *mature.sam  | while read iddosamtools view  -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }'  > ${id%%_*}.untsdone## bash count.mature.sh
上⾯的代码,是我⾃⼰写的脚本来算表达量,⾮常简单,因为我没有考虑细节,直接想得到各个样本测序数据的表达量⽽已。如果是⽐对到了参考基因组,就要根据miRNA的gff注释⽂件⽤htseq等软件来计算表达量。
得到了表达量,就可以跟⽂献来做⽐较
### step7: compare the results with paper'sGSM1470353: control-CM, experiment1; Homo sapiens; miRNA-Seq  SRR1542714GSM1470354: ET1-CM, experiment1; Homo sapiens; miRNA-Seq
SRR1542715GSM1470355: control-CM, experiment2; Homo sapiens; miRNA-SeqSRR1542716GSM1470356: ET1-CM, experiment2; Homo sapiens; miRNA-Seq SRR1542717GSM1470357: control-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542718GSM
1470358: ET1-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542719### 下⾯我⽤R语⾔来检验⼀下,我得到的分析结果跟⽂章发表的结果的区别。
a=read.table("bowtie_bam/unts")b=read.table("paper_results/GSM1470353_iPS_010313_Unstim_known_")plot(log(tmp[,2]),log(tmp[,3]))cor(tmp[,2],tmp[,3])## [1] 0.8413439
表达量相关性还不错,⾄少说明我跟作者的分析结果⾮常⼀致,应该是没有分析错。
这个代码是我⾃⼰根据⽂章的理解写出的,因为我本⾝不擅长miRNA数据分析,所以在进⾏alignment的时候参数选择可能并不是那么友好。
第六讲:miRNA表达量差异分析
这⼀讲是miRNA-seq数据分析的分⽔岭,前⾯的5讲说的是读⽂献下载数据⽐对,然后计算表达量,属于常规的流程分析,⼀般在公司测序之后都可以拿到分析结果,或者⽂献也会给出下载结果。
但是单纯的分析⼀个样本意义不⼤,⼀般来说,我们做研究都是针对于不同状态下的miRNA表达量差异分析,然后做注释,功能分析,⽹络分析,这才是重点和难点。
我这⾥就直接拿⽂献处理好的miRNA表达量来展⽰如何做下游分析,⾸先就是差异分析。
根据⽂献,我们可以知道样本的分类情况是
GSM1470353: control-CM, experiment1; Homo sapiens; miRNA-Seq SRR1542714
GSM1470354: ET1-CM, experiment1; Homo sapiens; miRNA-Seq SRR1542715
GSM1470355: control-CM, experiment2; Homo sapiens; miRNA-SeqSRR1542716
GSM1470356: ET1-CM, experiment2; Homo sapiens; miRNA-Seq SRR1542717
GSM1470357: control-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542718
GSM1470358: ET1-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542719
可以看到是6个样本的测序数据,分成两组,就是ET1刺激了CM细胞系前后对⽐⽽已!
同时,我们也拿到了这6个样本的表达矩阵,计量单位是counts的reads数,所以我们⼀般会选⽤DESeq2,edgeR这样的常⽤包来做差异分析,当然,做差异分析的⼯具还有⼗⼏个,我这⾥只是拿⼀根最顺⼿的举例⼦,就是DESeq2。
下⾯的代码有点长,因为我在bioconductor系列教程⾥⾯多次提到了DESeq2使⽤⽅法,这⾥就只贴出代
码,反正我要说的重点是,我们通过差异分析得到了差异miRNA列表
### step8: differential expression analysis by R package for miRNA expression patterns:## ⽂章⾥⾯提到的结果是:MicroRNA sequencing revealed over 250 known and 34 predicted novel miRNAs to be differentially expressed between ET-1 stimulated and unstimulated control hiPSC-CMs.## (FDR < 0.1 and 1.5 fold change)rm(list=ls())setwd('J:\\miRNA_test\\paper_results')  ##把从GEO⾥⾯下载的⽂献结果放在这⾥sampleIDs=c()groupList=c()allFiles=list.files(pattern = '.txt')i=allFiles[1]sampleID=strsplit(i,"_")[[1]][1]treat=strsplit(i,"_")[[1]][4]dat=read.table(i,stringsAsFactors =
F)colnames(dat)=c('miRNA',sampleID)groupList=c(groupList,treat)for (i in allFiles[-1]){sampleID=strsplit(i,"_")[[1]][1]treat=strsplit(i,"_")[[1]][4]a=read.table(i,stringsAsFactors =
F)colnames(a)=c('miRNA',sampleID)dat=merge(dat,a,by='miRNA')groupList=c(groupList,treat)}### 上⾯的代码只是为了把6个独⽴的表达⽂件给合并成⼀个表达矩阵## we need to filter the low expression level miRNAexprSet=dat[,-1]rownames(exprSet)=dat[,1]suppressMessages(library(DESeq2))exprSet=ceiling(exprSet)(colData <- data.frame(row.names=colnames(exprSet),
groupList=groupList))## DESeq2就是这么简单的⽤dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,design = ~ groupList)dds <- DESeq(dds)png("qc_dispersions.png", 1000, 1000, pointsize=20)plotDispEsts(dds, main="Dispersion plot")dev.off()res <- results(dds)## 画⼀些图,相当于做QC吧png("RAWvsNORM.png")rld <-
rlogTransformation(dds)exprSet_new=assay(rld)par(cex = 0.7)n.sample=ncol(exprSet)if(n.sample>40) par(cex = 0.5)cols <- rainbow(n.sample*1.2)par(mfrow=c(2,2))boxplot(exprSet,  col =
cols,main="expression value",las=2)boxplot(exprSet_new, col = cols,main="expression value",las=2)hist(exprSet[,1])hist(exprSet_new[,1])dev.off()library(RColorBrewer)(mycols <- brewer.pal(8, "Dark2")[1:length(unique(groupList))])# Sample distance heatmapsampleDists <- as.matrix(dist(t(exprSet_new)))#install.packages("gplots",repos = "cran.")library(gplots)png("qc-heatmap-samples.png", w=1000, h=1000, pointsize=20)heatmap.2(as.matrix(sampleDists), key=F, trace="none",col=colorpanel(100, "black", "white"),ColSideColors=mycols[groupList], RowSideColors=mycols[groupList],margin=c(10, 10), main="Sample Distance Matrix")dev.off()png("MA.png")DESeq2::plotMA(res, main="DESeq2", ylim=c(-2,2))dev.off()## 重点就是这⾥啦,得到了差异分析的结果resOrdered <- res[order(res$padj),]resOrdered=as.data.frame(resOrdered)write.csv(resOrdered,"sults.csv",quote
= F)##下⾯也是⼀些图,主要是看看样本之间的差异情况library(limma)plotMDS(log(counts(dds, normalized=TRUE) + 1))plotMDS(log(counts(dds, normalized=TRUE) + 1) - log(t( t(assays(dds)[["mu"]]) / sizeFactors(dds) ) + 1))plotMDS( assays(dds)[["counts"]] )  ## raw countplotMDS( assays(dds)[["mu"]] ) ##- fitted values.
最后我们得到的差异分析结果:sults.csv,就可以根据FDR和fold change来挑选符合要求的差异miRNA。
第七讲:miRNA样本配对mRNA表达量获取
这⼀讲其实算不上⾃学miRNA-seq分析,本质是affymetrix的mRNA表达芯⽚数据分析,⽽且还是最常⽤的那种GPL570 HG-U133Plus2,但因为是跟miRNA样本配对检测⽽且后⾯会利⽤到这两个数据分析结果来做共表达⽹络分析等等,所以就贴出对该芯⽚数据的分析结果。
⽂章⾥⾯也提到了 Messenger RNA expression analysis identified 731 probe sets with significant differential expression,作者挑选的差异分析结果的显著基因列表如下
/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0108051.s002 mRNA expression array - GSE60291 (Affymetrix Human Genome U133 Plus 2.0 Array)
hgu133plus2 芯⽚数据很常见,可以从GEO⾥⾯下载该study的原始测序数据,然后⽤affy,limma包来分析,也可以直接⽤GEOquery包来下载作者分析好的表达矩阵,然后直接做差异分析。我这⾥选择的是后者,⽽且我跟作者分析⽅法有⼀点区别是,我先把探针都注释好了基因,然后只挑最⼤表达量的基因。⽽作者是直接对探针为单位的的表达矩阵进⾏差异分析,对分析结果⾥⾯的探针进⾏基因注释。我这⾥⽆法给出哪种⽅法好的绝对评价。代码如下
m(list=ls())library(GEOquery)library(limma)GSE60291 <- getGEO('GSE60291', destdir=".",getGPL = F)#下⾯是表达矩阵exprSet=exprs(GSE60291[[1]])library("annotate")GSE60291[[1]]## 下⾯是分组信息pdata=pData(GSE60291[[1]])treatment=factor(unlist(lapply(pdata$title,function(x) strsplit(as.character(x),"-")[[1]][1])))#treatment=relevel(treatment,'control')## 下⾯做基因注释
platformDB='hgu133plus2.db'library(platformDB, ly=TRUE)probeset <- featureNames(GSE60291[[1]])#EGID <- as.numeric(lookUp(probeset, platformDB, "ENTREZID"))SYMBOL <-
lookUp(probeset, platformDB, "SYMBOL")## 下⾯对每个基因挑选最⼤表达量探针a=cbind(SYMBOL,exprSet)## remove the duplicated probesetrmDupID <-
function(a=matrix(c(1,1:5,2,2:6,2,3:7),ncol=6)){exprSet=a[,-1]rowMeans=apply(exprSet,1,function(x)
mean(as.numeric(x),na.rm=T))a=a[order(rowMeans,decreasing=T),]exprSet=a[!duplicated(a[,1]),]#exprSet=exprSet[!is.na(exprSet[,1]),]rownames(exprSet)=exprSet[,1]exprSet=exprSet[,-
1]return(exprSet)}exprSet=rmDupID(a)rn=rownames(exprSet)exprSet=apply(exprSet,2,as.numeric)rownames(exprSet)=rnexprSet[1:4,1:4]#exprSet=log(exprSet) ## based on
eboxplot(exprSet,las=2)## 下⾯⽤limma包来进⾏芯⽚数据差异分析design=model.matrix(~
treatment)fit=lmFit(exprSet,design)fit=eBayes(fit)#vennDiagram(decideTests(fit))DEG=topTable(fit,coef=2,n=Inf,adjust='BH')dim(DEG[abs(DEG[,1])>1.2 & DEG[,5]<0.05,])  ## 806
geneswrite.csv(DEG,"ET1-normal.DEG.csv")
得到的ET1-normal.DEG.csv ⽂件就是我们的差异分析结果,可以跟⽂章提供的差异结果做⽐较,⼏乎⼀模⼀样。
如果根据logFC:1.2 和pValue:0.05来挑选,可以拿到806个基因。
第⼋讲:miRNA-mRNA表达相关下游分析
通过前⾯的分析,我们已经量化了ET1刺激前后的细胞的miRNA和mRNA表达⽔平,也通过成熟的统计
学分析分别得到了差异miRNA和mRNA,这时候我们就需要换⼀个参考⽂献了,因为前⾯提到的那篇⽂章分析的不够细致,我这⾥选择了浙江⼤学的⼀篇TCGA数据挖掘分析⽂章:
Identifying miRNA/mRNA negative regulation pairs in colorectal cancer
⾥⾯⾸先查miRNA-mRNA基因对,因为miRNA主要还是负向调控mRNA表达,所以根据我们得到的两个表达矩阵做相关性分析,很容易得到符合统计学意义的miRNA-mRNA基因对,具体分析内容如下
把得到的差异miRNA的表达量画⼀个热图,看看它是否能显著的分类
⽤miRWalk2.0等数据库或者根据来获取这些差异miRNA的validated target genes
然后看看这些pairs of miRNA- target genes的表达量相关系数,选取显著正相关或者负相关的pairs
富集分析
这些被选取的pairs of miRNA- target genes拿去做富集分析
最后这些pairs of miRNA- target genes做PPI⽹络分析