根据motif binding来确定target gene | HOMER | FIMO | MEME

主流的motif数据库

JASPAR

dbcorrdb - SCENIC使用的

  

motif格式问题

我关注的这个motif (ENCSR000ARI) 是Ezh2的,并没有被收录在常规的TF数据库里,所以Homer里面没有。

dbcorrdb这个数据库里有,但是很老的JASPAR格式,可以转成meme格式。[dbcorrdb__EZH2__ENCSR000ARI_1__m5.png]

jaspar2meme -pfm -bundle ENCSR000ARI.all.sites > ENCSR000ARI.all.meme

 

可以用meme的Ceqlogo画出logo

ceqlogo -i1 ENCSR000ARI.m5.meme -o logo.png -f PNG

  

然后meme的FIMO可以直接把motif比对到指定的fasta

fimo --alpha 1 --max-strand -oc target ENCSR000ARI.all.meme target.promt.TSS.fasta
fimo --alpha 1 --max-strand -oc random ENCSR000ARI.all.meme target.promt.TSS.random.fasta

  

这里就有个问题了,怎么从genome中提取出核心区域
因为promoter的定义不是很完善,这里我用的区域是TSS区 [-400, 100]

其实Homer里面都写好了,但因为里面的perl过于复杂,最后还是决定用R来做(核心就是防止位置溢出)

library(GenomicFeatures)

# library(TxDb.Hsapiens.UCSC.hg19.knownGene)
# library(BSgenome.Hsapiens.UCSC.hg19)

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(BSgenome.Mmusculus.UCSC.mm10)

# PR <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream=2000, downstream=0)

# entrez geneID for a cell cycle control transcription
# factor, chr6 on the plus strand
# tmp.gene <- "429"  

# transcriptCoordsByGene.GRangesList <- transcriptsBy (TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") [tmp.gene]
# a GrangesList of length one, describing three transcripts

# promoter.seqs <- getPromoterSeq (head(PR), Hsapiens, upstream=10, downstream=0)

trs <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")

# trs <- keepStandardChromosomes(trs)

prom <- getPromoterSeq(trs, Mmusculus, upstream = 500, downstream=100)

head(prom)

  

gene ID转换问题

这里的list的name就是NCBI的genename,即Entrez ID(长见识了,同一个基因在不同物种中的entrez ID是不同的,所以你找的genecard上肯定是跟mouse的对不上)

以下是常见的geneID转换的脚本,ensemble ID转Entrez ID

mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", 
                                  "entrezgene_id", "ensembl_gene_id","ensembl_transcript_id"),
                      filters = "biotype",
                      values  = c("protein_coding"),
                      mart    = mart)

  

tmp.names <- select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys = names(prom), 
                    columns=c("TXNAME"), keytype="GENEID")

tmp.names$transcriptName <- sapply(strsplit(tmp.names$TXNAME,"\\."), function(x) x[1])

tmp.names$entrezID <- transcripts[tmp.names$transcriptName,]$entrezgene_id

  

最终画出了这个图:

根据motif binding来确定target gene | HOMER | FIMO | MEME

成图代码

options(repr.plot.width=6, repr.plot.height=6)
g <- ggplot(motif.bind, aes(x=pos, y=-log10(p.value))) + 
    facet_wrap( ~ gene, ncol=1) +
    geom_point() +
    geom_vline(xintercept=0, linetype="dashed", color = "blue") +
    geom_linerange(aes(x=pos, ymax=-log10(p.value), ymin=0),
                 position = position_jitter(height = 0L, seed = 1L), size=0.1) +
    theme_bw() +
    theme(strip.background = element_blank(), strip.text = element_text(face="italic", size = 15, color = "black")) +
    theme(axis.text.x  = element_text(face="plain", angle=0, size = 8, color = "black", vjust=-1),
            axis.text.y  = element_text(face="plain", size = 10, color = "black"),
            axis.title =element_text(size = 15)) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank()) +
    theme(legend.title = element_blank()) +
    labs(x = "\nRelative distance to TSS (bp)",y = "- log10(motif binding P-value)\n", title = " ") +
    scale_y_continuous(limits = c(0, 7), expand = c(0, 0)) #+
    #scale_fill_manual(values = brewer.pal(8,"Set1")) 
g

  

其实要是Homer有对应的TF motif数据,可以做得更加高效。不用提fasta,不用id转换,一行代码搞定。

findMotifs.pl  genelist.txt mouse motifResults/ -start -400 -end 100 -len 8,10,12 -find test.motif > result.txt

  

 

展望:

这里只考虑了TSS flanking region

有ChIP-seq和ATAC-seq peak的数据可以confirm这个denovo的结果

再加入Hi-C等三维基因组的数据就可以整合promoter和enhancer了。  

 

上一篇:操作系统-进程的初步实现


下一篇:技术支持垃圾邮件使用iframe“冻结”浏览器