我有两个数据帧如下:
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea"]})
df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
"keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})
我的数据框看起来像这样
DF1:
id string
01 This is a cat
02 That is a dog
03 Those are birds
04 These are bats
05 I drink coffee
06 I bought tea
DF2:
category keywords
1 cat
1 dog
2 birds
2 bats
3 coffee
3 tea
我想在df1上有一个输出列,如果在df1中的每个字符串中检测到df2中至少有一个关键字,则该类别,否则返回None.预期输出应如下.
id string category
01 This is a cat 1
02 That is a dog 1
03 Those are birds 2
04 These are bats 2
05 I drink coffee 3
06 I bought tea 3
我可以考虑逐个循环关键字并逐个扫描字符串,但如果数据变大则效率不高.我可以提出你的改进建议吗?谢谢.
解决方法:
# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea",
"This won't match squat"]})
您可以使用包含next参数的列表推导和默认参数.
df1['category'] = [
next((c for c, k in df2.values if k in s), None) for s in df1['string']]
df1
id string category
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought tea 3.0
6 07 This won't match squat NaN
你无法避免O(N2)的复杂性,但这应该是非常高效的,因为它并不总是必须迭代内部循环中的每个字符串(除非在最坏的情况下).
请注意,这当前仅支持子字符串匹配(不是基于正则表达式的匹配,尽管可以进行一些修改).