主流的motif数据库
dbcorrdb - SCENIC使用的
motif格式问题
我关注的这个motif (ENCSR000ARI) 是Ezh2的,并没有被收录在常规的TF数据库里,所以Homer里面没有。
dbcorrdb这个数据库里有,但是很老的JASPAR格式,可以转成meme格式。[dbcorrdb__EZH2__ENCSR000ARI_1__m5.png]
jaspar2meme -pfm -bundle ENCSR000ARI.all.sites > ENCSR000ARI.all.meme
可以用meme的Ceqlogo画出logo
ceqlogo -i1 ENCSR000ARI.m5.meme -o logo.png -f PNG
然后meme的FIMO可以直接把motif比对到指定的fasta
fimo --alpha 1 --max-strand -oc target ENCSR000ARI.all.meme target.promt.TSS.fasta fimo --alpha 1 --max-strand -oc random ENCSR000ARI.all.meme target.promt.TSS.random.fasta
这里就有个问题了,怎么从genome中提取出核心区域
因为promoter的定义不是很完善,这里我用的区域是TSS区 [-400, 100]
其实Homer里面都写好了,但因为里面的perl过于复杂,最后还是决定用R来做(核心就是防止位置溢出)
library(GenomicFeatures) # library(TxDb.Hsapiens.UCSC.hg19.knownGene) # library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(BSgenome.Mmusculus.UCSC.mm10) # PR <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream=2000, downstream=0) # entrez geneID for a cell cycle control transcription # factor, chr6 on the plus strand # tmp.gene <- "429" # transcriptCoordsByGene.GRangesList <- transcriptsBy (TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") [tmp.gene] # a GrangesList of length one, describing three transcripts # promoter.seqs <- getPromoterSeq (head(PR), Hsapiens, upstream=10, downstream=0) trs <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") # trs <- keepStandardChromosomes(trs) prom <- getPromoterSeq(trs, Mmusculus, upstream = 500, downstream=100) head(prom)
gene ID转换问题
这里的list的name就是NCBI的genename,即Entrez ID(长见识了,同一个基因在不同物种中的entrez ID是不同的,所以你找的genecard上肯定是跟mouse的对不上)
以下是常见的geneID转换的脚本,ensemble ID转Entrez ID
mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl")) att <- listAttributes(mart) transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "entrezgene_id", "ensembl_gene_id","ensembl_transcript_id"), filters = "biotype", values = c("protein_coding"), mart = mart)
tmp.names <- select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys = names(prom), columns=c("TXNAME"), keytype="GENEID") tmp.names$transcriptName <- sapply(strsplit(tmp.names$TXNAME,"\\."), function(x) x[1]) tmp.names$entrezID <- transcripts[tmp.names$transcriptName,]$entrezgene_id
最终画出了这个图:
成图代码
options(repr.plot.width=6, repr.plot.height=6) g <- ggplot(motif.bind, aes(x=pos, y=-log10(p.value))) + facet_wrap( ~ gene, ncol=1) + geom_point() + geom_vline(xintercept=0, linetype="dashed", color = "blue") + geom_linerange(aes(x=pos, ymax=-log10(p.value), ymin=0), position = position_jitter(height = 0L, seed = 1L), size=0.1) + theme_bw() + theme(strip.background = element_blank(), strip.text = element_text(face="italic", size = 15, color = "black")) + theme(axis.text.x = element_text(face="plain", angle=0, size = 8, color = "black", vjust=-1), axis.text.y = element_text(face="plain", size = 10, color = "black"), axis.title =element_text(size = 15)) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank()) + theme(legend.title = element_blank()) + labs(x = "\nRelative distance to TSS (bp)",y = "- log10(motif binding P-value)\n", title = " ") + scale_y_continuous(limits = c(0, 7), expand = c(0, 0)) #+ #scale_fill_manual(values = brewer.pal(8,"Set1")) g
其实要是Homer有对应的TF motif数据,可以做得更加高效。不用提fasta,不用id转换,一行代码搞定。
findMotifs.pl genelist.txt mouse motifResults/ -start -400 -end 100 -len 8,10,12 -find test.motif > result.txt
展望:
这里只考虑了TSS flanking region
有ChIP-seq和ATAC-seq peak的数据可以confirm这个denovo的结果
再加入Hi-C等三维基因组的数据就可以整合promoter和enhancer了。