Python:将数据结构索引集以查找超集?

在我的Python项目中,我有一个对象池,每个对象都用一组单词标记.我想生成所有集,包括映射到链接对象的标签子集.这些子集不应小于任何项目的完整标签集.例如,假设这些对象带有其标签:

apple: fruit, green, nature
sometree: tree, green, wood, nature
banana: fruit, nature
someplant: green, wood, nature
otherplant: green, wood, nature

结果应为:

(fruit, nature): banana, apple
(fruit, green, nature): apple
(green, wood, nature): sometree, someplant, otherplant
(green, wood, nature, tree): sometree

我不希望结果包含不存在的标记集,而标记集是至少一个对象的完整标记集.

为了达到这个目的,我想出了O(n²)算法,但想知道是否还有其他更聪明的方法,例如一些基于树的索引或前缀三叉戟?我不会有太多对象,也许是2000.但是,它应该很快.

我当前的解决方案迭代所有对象以创建将所有标签集映射到对象的字典.在第二次迭代中,我测试每个标签集是否是其他任何标签集的子集,如果是,请记住超集的对象.

更新:在上面的O(n²)中,n个引用了对象的数量.我将有许多对象,每个对象都有大约五个标签.

解决方案:感谢所有答复.我最终使用了azorius的方法,因为它既快速又健壮.这是列出所有组的完整示例:

tagged = { 
    'apple': ['fruit', 'green', 'nature'],
    'sometree': ['tree', 'green', 'wood', 'nature'],
    'banana': ['fruit', 'nature'],
    'someplant': ['green', 'wood', 'nature'],
    'otherplant': ['green', 'wood', 'nature']
}   

tag_sets = set()
tags_to_objects = {}

for obj, tags in tagged.items():
    for tag in tags:
        try:
            tags_to_objects[tag].add(obj)
        except KeyError:
            tags_to_objects[tag] = set([obj])
    # Record all tag sets
    tag_set = frozenset(tags)
    tag_sets.add(tag_set)

groups = {}
for tag_set in tag_sets:
    objects = None
    for tag in tag_set:
        if objects:
            objects = objects.intersection(tags_to_objects[tag])
        else:
            objects = tags_to_objects[tag]
    groups[tag_set] = objects

for k,v in groups.items():
    print '(%s): %s' % (', '.join(k), ', '.join(v))

由于那里有几种解决方案,我对1035个对象进行了一些时序测试,每个对象带有大约五个标签:

> azorius代码:127ms
>布鲁斯代码:115ms
>药丸密码:137ms

在我看来,pillmuncher的代码是最优雅的.

解决方法:

为什么不使用集合?根据http://wiki.python.org/moin/TimeComplexity,集合具有以下平均时间复杂度O(min(len(s),len(t)),aka O(n)!

fruit = set(("apple", "banana"))
green = set(("apple", "sometree", "someplant", "otherplant"))
nature = set(("apple", "banana", "sometree", "someplant", "otherplant"))
tree = set (("sometree",))
wood = set (("sometree", "someplant", "otherplant"))

In [90]: fruit.intersection(nature)
Out[90]: set(['apple', 'banana'])

In [91]: fruit.intersection(nature).intersection(green)
Out[91]: set(['apple'])

In [92]: green.intersection(wood).intersection(nature)
Out[92]: set(['sometree', 'someplant', 'otherplant'])

In [93]: green.intersection(wood).intersection(nature).intersection(tree)
Out[93]: set(['sometree'])

回顾这段代码,一个更优雅的答案是使用reduce:

In [90]: reduce(set.intersection, [fruit, nature])
Out[90]: set(['apple', 'banana'])

In [91]: reduce(set.intersection, [fruit, nature, green])
Out[91]: set(['apple'])

In [92]: reduce(set.intersection, [green, wood, nature])
Out[92]: set(['sometree', 'someplant', 'otherplant'])

In [93]: reduce(set.intersection, [green, wood, nature, tree])
Out[93]: set(['sometree'])
上一篇:在mysql表上创建索引


下一篇:mysql-添加FULLTEXT索引:没有太大区别,它将对较旧的数据进行索引吗?