上一期代码实现1中,我们利用kermer的思路,通过条件语句得到两个数据集,一个集合为包含单一毒株所有基因序列的20bp片段,另一个集合为同时出现同一株所有(不同基因组)的基因组中,本期内容我们将得到能满足基本要求候选引物。
一、计算每个毒株kermer出现的百分比
上一期我们的想法比较理想化,使用“全”和“无”的思路去解决问题,事实上,我初步的结果无法满足我的要求。因此今天的文章我们做了更正,首先我们统计相应毒株kermer的序列在总kermer中出现的比例。
代码如下:
f = open("Alpha_kermer.txt","a+")
f.write("kermer" + "\t" + "genome" + "\t" + "used_genome" + "\t" + "kermer_number" + "\t" + "frequency" + "\t" + "freq_percent" + "\n")
k_number = 0
for key,value in bb:
f.write(key + "\t" + str(i) + "\t" + str(ii) + "\t" + str(len(bb)) + "\t" + str(value) + "\t" + str('%.2f'%(value * 100 / ii)) + "%" + "\n")
f.close()
结果如下展示,从左到右分别是kermer序列,基因组数量,本次使用的基因组数量,kermer的总数,该kermer出现的次数,以及kermer出现的比例。
上图展示了Alpha毒株kermer的部分数据,其他毒株的结果使用查找替换上面的“Alpha”为相应的毒株名称即可。
二、计算单一毒株kermer在其他毒株kermer中出现的比例
f2 = open("Alpha_kermer.txt","r")
f4 = open("Beta_kermer.txt","r")
f6 = open("Gamma_kermer.txt","r")
f8 = open("Delta_kermer.txt","r")
Alpha_kermer = []
i = 0
for line in f2.readlines():
i += 1
if i > 1:
Alpha_kermer.append(line.replace("\n",""))
f2.close()
Beta_kermer = []
i = 0
for line in f4.readlines():
i += 1
if i > 1:
Beta_kermer.append(line.replace("\n",""))
f4.close()
Gamma_kermer = []
i = 0
for line in f6.readlines():
i += 1
if i > 1:
Gamma_kermer.append(line.replace("\n",""))
f6.close()
Delta_kermer = []
i = 0
for line in f8.readlines():
i += 1
if i > 1:
Delta_kermer.append(line.replace("\n",""))
f8.close()
f9 = open("Alpha_in_Beta_Gamma_Delta.txt","a+")
f9.write("A_kermer" + "\t" + "per_A_kermer" + "\t" + "per_B_kermer" + "\t" + "per_G_kermer" + "\t" + "per_D_kermer" + "\n")
for i in range(len(Alpha_kermer) - 1):
Alpha = Alpha_kermer[i].split("\t")
m = 0
for ii in range(len(Beta_kermer) - 1):
Beta = Beta_kermer[ii].split("\t")
if Alpha[0] == Beta[0]:
f9.write(Alpha[0] + "\t" + Alpha[5] + "\t" + Beta[5] + "\t")
else:
m += 1
if m == (len(Beta_kermer) - 1):
f9.write(Alpha[0] + "\t" + Alpha[5] + "\t" + "0.00%" + "\t")
m = 0
for ii in range(len(Gamma_kermer) - 1):
Gamma = Gamma_kermer[ii].split("\t")
if Alpha[0] == Gamma[0]:
f9.write(Gamma[5] + "\t")
else:
m += 1
if m == (len(Gamma_kermer) - 1):
f9.write("0.00%" + "\t")
m = 0
for ii in range(len(Delta_kermer) - 1):
Delta = Delta_kermer[ii].split("\t")
if Alpha[0] == Delta[0]:
f9.write(Delta[5] + "\n")
else:
m += 1
if m == (len(Delta_kermer) - 1):
f9.write("0.00%" + "\n")
f9.close()
结果如下图所示,从左到右分别是Alpha的kermer序列,以及其kermer在Alpha、Beta、Gamma、Delta中出现的比例。
上图展示了Alpha毒株kermer的部分数据,其他毒株的结果方法类似。
三、筛选高特异性的kermer
筛选方法为高频率出现在特定毒株而低频率出现在其他毒株的kermer。
f9 = open("Alpha_in_Beta_Gamma_Delta.txt","r")
Alpha = []
i = 0
for line in f9.readlines():
i += 1
if i > 1:
Alpha.append(line.replace("\n",""))
f9.close()
f9_1 = open("Alpha_little_in_Beta_Gamma_Delta.txt","a+")
f9_1.write("A_kermer" + "\t" + "per_A_kermer" + "\t" + "per_B_kermer" + "\t" + "per_G_kermer" + "\t" + "per_D_kermer" + "\n")
for i in range(len(Alpha) - 1):
Alpha_little = Alpha[i].split("\t")
if (float(Alpha_little[1].replace("%","")) > 95) & (float(Alpha_little[2].replace("%","")) < 5) & (float(Alpha_little[3].replace("%","")) < 5) & (float(Alpha_little[4].replace("%","")) < 5):
f9_1.write(Alpha[i] + "\n")
f9_1.close()
结果如下图所示:
四、筛选候选引物
以每个毒株的第一条序列作为索引的模板,将其分成20bp的kermer片段纳入数组中,那么其中第一个元素索引为0,第二个元素索引为1,第n个元素索引为n-1,那么,引物所扩增的产物大小应该在75-250bp之间,所以Forward Primer和Reverse Primer的索引数值的关系为:75 < Fm - Rn + 20 < 250,其中m、n分别为前引物和后引物索引值。
引物筛选另外一个要考虑的是序列的Tm值,PCR扩增时的退火温度跟Tm值息息相关,Tm值叫做熔解温度(melting temperature,Tm)是指吸光值增加到最大值的一半时的温度称为DNA的解链温度或熔点。通俗讲Tm值是DNA的熔解温度,是DNA变性过程中紫外吸收达到最大值一半时的温度。不同序列的DNA,Tm值不同,DNA中C-G含量越高,Tm值就越大,当核苷酸长度小于等于20bp时,其计算公式为Tm = 4℃(G + C)+ 2℃(A + T),公式中A、G、C、T表示碱基的数量,一般qpcr扩增程序如下所示:
防污染阶段:50度2min
预变性阶段:95度10min
变性和延伸阶段:95度10s,60度40s,记录荧光信号,40个循环
Tm值通常比退火温度高2度左右,也就是在62度左右,一般我们设置一个比较宽泛的范围如50度到70度之间。
除此之外,引物要求5‘端以“A”或“T”开始,3’端以“C”或“G结尾,并且序列中不能有连续”A“或”G“或”C“或”T“,完成上述要求的python程序语言如下:
Alpha_only = []
f9_1 = open("Alpha_little_in_Beta_Gamma_Delta.txt","r")
n = 0
for line in f9_1.readlines():
n += 1
if n > 1:
Alpha_element = line.replace("\n","").split("\t")
Alpha_only.append(Alpha_element[0])
f9_1.close()
AA = len(Alpha_only)
Alpha0 = []
i = 0
a = ""
f = open("/home/lxh/Alpha.fasta","r")
for line in f.readlines():
if ">" in line:
i += 1
Alpha0.append(a)
a = ""
if ">" not in line:
a += line.replace("\n","")
del Alpha0[0]
f.close()
Alpha0_kermer = []
for mm in range(len(Alpha0[0]) - 20):
Alpha0_kermer.append(Alpha0[0][mm:(20 + mm)])
index_Alpha = {}
for i in range(AA - 1):
index = Alpha0_kermer.index(Alpha_only[i])
index_Alpha[Alpha_only[i]] = index
index_sort_A = sorted(index_Alpha.items(),key = lambda x:x[1],reverse = False)
Alpha_num = []
for x,y in index_sort_A:
Alpha_num.append(y)
f_primer = open("Alpha_primer.txt","a+")
NN = len(Alpha_num)
for e in range(NN - 1):
mgb = ''.join(Alpha0_kermer[Alpha_num[e]])
A = mgb.count("A")
G = mgb.count("G")
C = mgb.count("C")
T = mgb.count("T")
Tm1 = 4 * (G + C) + 2 * (A + T)
if (Tm1 < 70) & (Tm1 > 50) & ("AAA" not in mgb) & ("GGG" not in mgb) & ("CCC" not in mgb) & ("TTT" not in mgb) & ((mgb[0] == "A") or (mgb[0] == "T")) & ((mgb[19] == "C") or (mgb[19] == "G")):
number = 0
for r in range((e + 1),NN):
if ((Alpha_num[r] - Alpha_num[e]) > 55) & ((Alpha_num[r] - Alpha_num[e]) < 230):
mgb = ''.join(Alpha0_kermer[Alpha_num[r]])
A = mgb.count("A")
G = mgb.count("G")
C = mgb.count("C")
T = mgb.count("T")
Tm2 = 4 * (A + T) + 2 * (G + C)
if (Tm2 < 70) & (Tm2 > 50) & (abs(Tm2 - Tm1) < 10) & ("AAA" not in mgb) & ("GGG" not in mgb) & ("CCC" not in mgb) & ("TTT" not in mgb) & ((mgb[0] == "C") or (mgb[0] == "G")) & ((mgb[19] == "A") or (mgb[19] == "T")):
number += 1
f_primer.write(Alpha0_kermer[Alpha_num[r]] + "\n")
if number > 0:
f_primer.write("<" + Alpha0_kermer[Alpha_num[e]] + ">" + "\n" + str(number) + "\n")
f_primer.close()
如果候选引物序列少或者不存在,可适当对上述部分要求放宽松。
下面展示其中一个候选特异性检测Alpha毒株的引物:
Forward_Primer: 5'-ACAGTCATGTACTTAACATC-3'
Reverse_Primer: 5'-TTACCGATATCGATGCACTG-3'
扩增序列(168bp)为:
ACAGTCATGTACTTAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCACTTCTATTCTAAATGGTATATTAGAGTAGGAGCTATAAAATCAGCACCTTTAATTGAATTGTGCGTGGATGAGGCTGGTTCTTAATCACCCATTCAGTGCATCGATATCGGTAA
五、引物验证——特异性检验
-
物种特异性检验——NCBI blast
national center for biotechnology information,简称NCBI,是生命科学最重要的数据库之一,包含发表的文章、各种序列、结构等,还有其附加功能如:比对、注释信息、分类信息、SNP分析、引物在线设计等功能。我们利用NCBI的blast功能进行引物的比对,看其是否会mapping到其他物种。
比对结果全部为新冠基因组,特异性很好!一个检测新冠变异株试剂盒的研发项目,其最基础也是最重要的部分就完成了。