我正在使用itertools.groupby来解析一个短的制表符分隔的文本文件.文本文件有几列,我要做的就是将所有在特定列中具有特定值x的条目归为一组.下面的代码对名为name2的列执行此操作,以查找变量x中的值.我试图使用csv.DictReader和itertools.groupby来做到这一点.在表中,有8行符合此条件,因此应返回8个条目.相反,groupby返回两组条目,一组具有单个条目,另一组具有7,这似乎是错误的行为.我在下面对相同的数据手动进行匹配,并得到正确的结果:
import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
if name == "ENSMUSG00000002459":
wrong_result = [e for e in entries]
print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
if row[col_name] == "ENSMUSG00000002459":
correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))
我得到的输出是:
looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong:
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct:
correct result has 8 entries
这里发生了什么?如果groupby是真正的分组,似乎每个x我应该只获得一组条目,但是它返回两个.我无法弄清楚.编辑:啊,它应该进行排序.
解决方法:
您将要更改代码以强制数据按关键顺序排列…
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
pass # whatever
但是,主要用途是当数据集很大且数据已经按键顺序排列时,因此无论如何您都必须进行排序,那么使用defaultdict更为有效
from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
name_entries[row[col_name]].append(row)