我是一位试图验证实验的生物学家.在我的实验中,经过特殊的治疗,我发现了71个突变.为了确定这些突变是否真正归因于我的治疗,我想将它们与一组随机产生的突变进行比较.我建议我可能尝试生成一百万套71个随机突变,以进行统计比较.
首先,我有一个感兴趣的基因组中包含7000个基因的数据框.我知道他们的起点和终点.数据框的前五行如下所示:
transcript_id protein_id start end kogClass
0 g2.t1 695054 1 1999 Replication, recombination and repair
1 g3.t1 630170 2000 3056 General function prediction only
2 g5.t1 695056 3057 4087 Signal transduction mechanisms
3 g6.t1 671982 4088 5183 N/A
4 g7.t1 671985 5184 8001 Chromatin structure and dynamics
现在大约有100万套71个随机突变:我编写了一个函数,调用了100万次,这似乎不是很有效,因为经过4个小时,它只是整个过程的1/10.这是我的代码.如果有人可以建议加快速度,我欠您一杯啤酒!和我的赞赏.
def get_71_random_genes(df, outfile):
# how many nucleotides are there in all transcripts?
end_pos_last_gene = df.iloc[-1,3]
# this loop will go 71 times
for i in range(71):
# generate a number from 1 to the end of all transcripts
random_number = randint(1, end_pos_last_gene)
# this is the boolean condition - checks which gene a random number falls within
mask = (df['start'] <= random_number) & (df['end'] >= random_number)
# collect the rows that match
data = df.loc[mask]
# write data to file.
data.to_csv(outfile, sep='\t', index=False, header=False)
解决方法:
我很确定以下所有内容都可以:
for i in range(71):
# generate a number from 1 to the end of all transcripts
random_number = randint(1, end_pos_last_gene)
# this is the boolean condition - checks which gene a random number falls within
mask = (df['start'] <= random_number) & (df['end'] >= random_number)
# collect the rows that match
data = df.loc[mask]
# write data to file.
data.to_csv(outfile, sep='\t', index=False, header=False)
是从数据框中选择71个随机行而不进行替换.注意,这是永远的,因为每次
(df['start'] <= random_number) & (df['end'] >= random_number)
您遍历整个数据帧三次,然后再重复一次:
data = df.loc[mask]
这是对行进行采样的效率极低的方法.您可以通过随机采样71个索引,然后直接在数据帧上使用这些索引(甚至不需要对数据帧进行一次完整传递)来更有效地执行此操作.但是您不需要这样做,pd.DataFrame对象已经实现了有效的示例方法,因此请注意:
In [12]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)), columns=["c%d"%d for d in range(10)])
In [13]: df
Out[13]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 13 0 19 5 6 17 5 14 5 15
1 2 4 0 16 19 11 16 3 11 1
2 18 3 1 18 12 9 13 2 18 12
3 2 6 14 12 1 2 19 16 0 14
4 17 5 6 13 7 15 10 18 13 8
5 7 19 18 3 1 11 14 6 13 16
6 13 5 11 0 2 15 7 11 0 2
7 0 19 11 3 19 3 3 9 8 10
8 6 8 9 3 12 18 19 8 11 2
9 8 17 16 0 8 7 17 11 11 0
In [14]: df.sample(3, replace=True)
Out[14]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 13 0 19 5 6 17 5 14 5 15
3 2 6 14 12 1 2 19 16 0 14
3 2 6 14 12 1 2 19 16 0 14
In [15]: df.sample(3, replace=True)
Out[15]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
9 8 17 16 0 8 7 17 11 11 0
4 17 5 6 13 7 15 10 18 13 8
2 18 3 1 18 12 9 13 2 18 12
In [16]: df.sample(3, replace=True)
Out[16]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
3 2 6 14 12 1 2 19 16 0 14
8 6 8 9 3 12 18 19 8 11 2
4 17 5 6 13 7 15 10 18 13 8
因此,只需将该循环替换为:
df.sample(71, replace=True).to_csv(outfile, sep='\t', index=False, header=False)
注意,这也减少了I / O开销!
因此,只是要做一个快速测试:
In [4]: import time
...: start = time.time()
...: with open('test.csv', 'w') as f:
...: for _ in range(1000):
...: df.sample(71, replace=True).to_csv(f, header=None, index=False)
...: stop = time.time()
...:
In [5]: stop - start
Out[5]: 0.789172887802124
因此,线性推算,我将进行1,000,000次gesstimate,大约需要:
In [8]: (stop - start) * 1000
Out[8]: 789.172887802124
秒,大约十分钟
In [10]: !wc -l test.csv
71000 test.csv
编辑以添加更有效的方法
因此,创建一个映射到数据框中的索引的数组:
size = df.end.max()
nucleotide_array = np.zeros(size, dtype=np.int) # this could get out of hand without being careful of our size
for row in df.itertuples(): # might be alittle slow, but its a one-time upfront cost
i = row.start - 1
j = row.end
nucleotide_array[i:j] = row.Index
# sampling scheme:
with open('test.csv', 'w') as f:
for _ in range(1000): # how ever many experiments
snps = np.random.choice(nucleotide_array, 71, replace=True)
df.loc[snps].to_csv(f, header=None, index=False)
请注意,上面只是一个简单的草图,尚未真正测试过.它做出了假设,但我认为它们成立了,无论如何,您可以轻松地调整df使其起作用.