可变剪切的预测已经很流行了,目前主要有两个流派:
- 用DNA序列以及variant来预测可变剪切;GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
- 用RNA来预测可变剪切;MISO、rMATS、DARTS
前言废话
科研圈的热点扎堆现象是永远存在的,且一波接一波,大部分不屑于追热点且不出成果也基本都被圈子给淘汰了。
做纯方法开发的其实是很心累的,费时费力费脑,特别是自己的研究领域已经过时的时候,另外还得承受外行的歧视:“你们搞这个有什么用吗?文章也发不了,最后也没人用。”
最近这些年最大的一个热点就是“单细胞”,很多人都趁着这股东风捞了一些文章,最早一批开发方法的也发了不少nature method和NBT,bioinformatics和NAR更多。但大部分后面就销声匿迹了,因为门槛越来越低,进入者越来越多,经过几年的发展,现在已经成了三足鼎立之势,强者愈强,弱者退场。
写方法类的文章也有个潜规则,千万不要写得过于通俗易懂,大部分审稿人如果一眼就能看懂,就会自然认为你做的研究过于简单,没有发表的必要。最好要写得有理有据,且90%的审稿人没法一眼看懂,但细细琢磨后有那么点意思。哈哈,当笑话听就好。
跳到另外一篇用深度学习来预测可变剪切的。
Deep-learning augmented RNA-seq analysis of transcript splicing
文章里面需要重点了解的基础知识:
Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10,看看前人是如何根据cis sequence特征来预测exon的剪切模式的
涉及到的文献:
The human splicing code reveals new insights into the genetic determinants of disease - 2015
Deciphering the splicing code - 2010
Deep learning of the tissue-regulated splicing code - 2014
BRIE: transcriptome-wide splicing quantification in single cells - 2017
哈哈,深度学习在可变剪切上的应用的风2014年就开始刮了,你不可能是第一个吃螃蟹的了。
想了解什么是AS,可以直接看现在开发的工具,里面肯定有图详细介绍,同时介绍其算法,一图胜千言。
MISO (Mixture of Isoforms) software documentation 目前只支持python2版本,用conda的话还需要从文档中copy一下miso_settings.txt文件。
生物和信息之间存在一个巨大的gap,优秀的人能很快察觉到这个gap,并填补这个gap。
问题:
为什么AS的鉴定依赖测序深度?得了解现在主流的AS检测算法
怎么理解样本间的差异可变剪切这个概念?
如何理解cis sequence features,这个文件里都包含了哪些信息?
怎么predict exon-inclusion/skipping levels in bulk tissues or single cells
怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.
两部分:
a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features
a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing
During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.
During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.
This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.
差不多懂了,第一个BHT就是常规差异剪切分析工具(如MISO 和 rMATS)的升级版,用于制造有lable的训练数据。BHT的结果用于训练DNN模型;新的数据可以放进DNN模型里,得到的结果可以作为后面贝叶斯模型的先验,我们的RNA-seq数据则是用于更新先验形成后验,如果先验足够准确,则更新时对数据的依赖不搞,这也就是为什么该方法可以弥补RNA-seq测序深度不足的情形。
To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.
cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions
这个DNN把可变剪切转换成了一个regression的问题,特征就是上面两种,因为它们决定了最终的一个特征是否发生了可变剪切。
最终用到的特征:2,926 cis sequence features and 1,498 annotated RBPs
DNN用到的训练数据具体是什么?
large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium
We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons
The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN
From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 这样就把每个exon给的特征给提取出来了,lable也有了,就可以用于训练了。
比较了三个模型:
We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.
关于贝叶斯模型的部分:
incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).
Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data
如果文章看得迷迷糊糊的,就直接跑代码吧!
第一个功能BHT:
Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/
test_norep_data.txt文件是这样的:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen
82439 ENSG00000169045.17_1 HNRNPH1 chr5 - 179046269 179046408 179045145 179045324 179047892 179048036 82439 15236 319 6774 834 180 90
21374 ENSG00000131876.16_3 SNRPA1 chr15 - 101826418 101826498 101825930 101826006 101827112 101827215 21374 4105 118 292 54 169 90
32815 ENSG00000141027.20_3 NCOR1 chr17 - 15990485 15990659 15989712 15989756 15995176 15995232 32815 624 564 549 1261 180 90
43143 ENSG00000133731.9_2 IMPA1 chr8 - 82597997 82598198 82593732 82593819 82598486 82598518 43143 155 332 22 341 180 90
111671 ENSG00000100320.22_3 RBFOX2 chr22 - 36232366 36232486 36205826 36206051 36236238 36236460 111671 93 193 35 534 180 90
每一行是一个基因,无冗余,然后就是一些属性.
跑出来的结果是这样的:
1 ID I1 S1 I2 S2 inc_len skp_len psi1 psi2 delta.mle post_pr
2 1225 160 0 169 6 180 90 1 0.934 -0.0663 0.4367
3 15829 52 58 12 41 180 90 0.31 0.128 -0.1819 0.8867
4 20347 1084 930 371 615 180 90 0.368 0.232 -0.1365 1
5 21374 4105 118 292 54 169 90 0.949 0.742 -0.2065 1
6 24817 177 275 263 741 143 90 0.288 0.183 -0.1057 0.974
7 32815 624 564 549 1261 180 90 0.356 0.179 -0.1774 1
8 43143 155 332 22 341 180 90 0.189 0.031 -0.158 1
9 46548 1685 4040 216 1752 180 90 0.173 0.058 -0.1145 1
每一行是对之前条目的预测。
第二个功能DNN:
下载model
Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS
预测
Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS
其中的第一个文件是Input feature file (*.h5) or Darts_BHT output (*.txt)
ID I1 S1 I2 S2 inc_len skp_len mu.mle delta.mle post_pr
chr1:-:10002681:10002840:10002738:10002840:9996576:9996685 581 0 462 0 155 99 1 0 0
chr1:-:100176361:100176505:100176389:100176505:100174753:100174815 28 0 49 2 126 99 1 -0.0493827160493827 0.248
chr1:-:109556441:109556547:109556462:109556547:109553537:109554340 2 37 0 81 119 99 0.0430341230167355
-0.0430341230167355 0.188
chr1:-:11009680:11009871:11009758:11009871:11007699:11008901 11 2 49 4 176 99 0.755725190839695 0.117542135892979 0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005 80 750 64 738 133 99 0.0735580941766509 -0.0129207126090368 0
第二个文件是Kallisto expression files
thymus adipose
RPS11 2678.83013 2531.887535
ERAL1 14.350975 13.709394
DDX27 18.2573 14.02368
DEK 32.463558 14.520312
PSMA6 102.332592 77.089475
TRIM56 4.519675 6.14762566667
TRIM71 0.082009 0.0153936666667
UPF2 7.150812 5.23628033333
FARS2 6.332831 7.291382
ALKBH8 3.056208 1.27043633333
ZNF579 5.13265 8.248575
结果文件,第一列是ID,第二列是真实的标签,第三列是预测的标签:
ID Y_true Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437 1.000000 0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620 1.000000 0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140 0.947333 0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993 1.000000 0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213 0.937333 0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750 1.000000 0.450762
参考:
The Expanding Landscape of Alternative Splicing Variation in Human Populations.
这篇是比较纯粹的DL应用:
Gene expression inference with deep learning | 基于深度学习的基因表达推测
案例文章:Gene expression inference with deep learning
uci-cbcl/D-GEX - github
深度学习的风已经过了几年了,目前在医疗影像处理上已经公认非常有效,所以后面想发文章必须数据足够大足够靓,方法上想创新太难。
LINCS L1000 data
核心的意思就是这个项目只测了不到一千个基因的表达,却要通过LR和DL来推测出其他全部的3万个基因的表达,所以称那978个基因叫landmark genes。