python – 在一列字符串中搜索多个子字符串并返回子字符串类别

我有两个数据帧如下:

df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
                    "string":["This is a cat",
                              "That is a dog",
                              "Those are birds",
                              "These are bats",
                              "I drink coffee",
                              "I bought tea"]})

df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
                    "keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})

我的数据框看起来像这样

DF1:

id   string
01   This is a cat
02   That is a dog
03   Those are birds
04   These are bats
05   I drink coffee
06   I bought tea

DF2:

category   keywords
1          cat
1          dog
2          birds
2          bats
3          coffee
3          tea

我想在df1上有一个输出列,如果在df1中的每个字符串中检测到df2中至少有一个关键字,则该类别,否则返回None.预期输出应如下.

id   string             category
01   This is a cat         1
02   That is a dog         1
03   Those are birds       2
04   These are bats        2
05   I drink coffee        3
06   I bought tea          3

我可以考虑逐个循环关键字并逐个扫描字符串,但如果数据变大则效率不高.我可以提出你的改进建议吗?谢谢.

解决方法:

# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
                    "string":["This is a cat",
                              "That is a dog",
                              "Those are birds",
                              "These are bats",
                              "I drink coffee",
                              "I bought tea", 
                              "This won't match squat"]})

您可以使用包含next参数的列表推导和默认参数.

df1['category'] = [
    next((c for c, k in df2.values if k in s), None) for s in df1['string']] 

df1
   id                  string  category
0  01           This is a cat       1.0
1  02           That is a dog       1.0
2  03         Those are birds       2.0
3  04          These are bats       2.0
4  05          I drink coffee       3.0
5  06            I bought tea       3.0
6  07  This won't match squat       NaN

你无法避免O(N2)的复杂性,但这应该是非常高效的,因为它并不总是必须迭代内部循环中的每个字符串(除非在最坏的情况下).

请注意,这当前仅支持子字符串匹配(不是基于正则表达式的匹配,尽管可以进行一些修改).

上一篇:使用Java进行小数据集的数据查找方法?


下一篇:java – 检查字典中的单词