pandas的数据库操作-筛选数据

我们都知道SQL数据库的语句很简洁,python中的pandas库也很好用,但是如何将两者给关联起来???

本文将通过pandas来实现类似于SQL中的【过滤】、【排序】、【关联】、【合并】、【更新】、【删除】等操作。

用到的数据集

from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
df = pd.DataFrame(boston.data,columns = boston.feature_names)
df['target'] =pd.Series(boston.target) 

df
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns

简单的字段查询

‘’‘SQL语句’’’

  • SELECT CRIM,ZN,CHAS,NOX,RM,RAD FROM boston LIMIT 2;

实现返回每行记录的CRIM,ZN,CHAS,NOX,RM,RAD字段,返回2行。

search = df[["CRIM","ZN","CHAS","NOX","RM","RAD"]].head(2)
search
CRIM ZN CHAS NOX RM RAD
0 0.00632 18.0 0.0 0.538 6.575 1.0
1 0.02731 0.0 0.0 0.469 6.421 2.0

简单的条件过滤 WHERE

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE CHAS=1 LIMIT 2;
set(df['CHAS']) # 我们将CHAS特征作为分类标签
{0.0, 1.0}
search = df[df['CHAS']==1].head(2)
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
142 3.32105 0.0 19.58 1.0 0.871 5.403 100.0 1.3216 5.0 403.0 14.7 396.90 26.82 13.4
152 1.12658 0.0 19.58 1.0 0.871 5.012 88.0 1.6102 5.0 403.0 14.7 343.28 12.12 15.3

多条件与或查询WHERE AND|OR

如上满足SQL实现查询同时满足CHAS=1和CRIM>1.0两个条件的记录,返回两行

与关系&

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE CHAS=1 AND CRIM>=1.0 LIMIT 2;
search = df[(df['CHAS']==1) & (df['CRIM'] >= 1.0)] .head(2) # 注意这里条件是用的圆括号!!!不是方括号
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
142 3.32105 0.0 19.58 1.0 0.871 5.403 100.0 1.3216 5.0 403.0 14.7 396.90 26.82 13.4
152 1.12658 0.0 19.58 1.0 0.871 5.012 88.0 1.6102 5.0 403.0 14.7 343.28 12.12 15.3

或关系|

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE CHAS=1 OR CRIM>=1.0 LIMIT 2;
search = df[(df['CHAS']==1) | (df['CRIM'] >= 1.0)] .head(2) # 注意这里条件是用的圆括号!!!不是方括号
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
16 1.05393 0.0 8.14 0.0 0.538 5.935 29.3 4.4986 4.0 307.0 21.0 386.85 6.58 23.1
20 1.25179 0.0 8.14 0.0 0.538 5.570 98.1 3.7979 4.0 307.0 21.0 376.57 21.02 13.6

条件过滤 控制判断

空判断 查询null的记录

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE CHAS IS NULL;
search = df[df['CHAS'].isna()]# 注意这里条件是用的圆括号!!!不是方括号
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target

非空判断is not null

查询不为null的记录

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE CHAS IS NOT NULL;
search = df[df['CHAS'].notna()]# 注意这里条件是用的圆括号!!!不是方括号
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns

排序 ORDER BY ASC|DESC

‘’‘SQL语句’’’

  • SELECT * FROM boston WHERE INDUS>10 ORDER BY DESC CHAS;

满足INDUS>10的值按照CHAS进行降序排列

search = df[(df['INDUS']>10)].sort_values(by='CHAS',ascending=False)
search
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
162 1.83377 0.0 19.58 1.0 0.605 7.802 98.2 2.0407 5.0 403.0 14.7 389.61 1.92 50.0
212 0.21719 0.0 10.59 1.0 0.489 5.807 53.8 3.6526 4.0 277.0 18.6 390.94 16.03 22.4
364 3.47428 0.0 18.10 1.0 0.718 8.780 82.9 1.9047 24.0 666.0 20.2 354.55 5.29 21.9
363 4.22239 0.0 18.10 1.0 0.770 5.803 89.0 1.9047 24.0 666.0 20.2 353.04 14.64 16.8
357 3.84970 0.0 18.10 1.0 0.770 6.395 91.0 2.5052 24.0 666.0 20.2 391.34 13.27 21.7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
293 0.08265 0.0 13.92 0.0 0.437 6.127 18.4 5.5027 4.0 289.0 16.0 396.90 8.58 23.9
294 0.08199 0.0 13.92 0.0 0.437 6.009 42.3 5.5027 4.0 289.0 16.0 396.90 10.40 21.7
295 0.12932 0.0 13.92 0.0 0.437 6.678 31.1 5.9604 4.0 289.0 16.0 396.90 6.27 28.6
296 0.05372 0.0 13.92 0.0 0.437 6.549 51.0 5.9604 4.0 289.0 16.0 392.85 7.39 27.1
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

236 rows × 14 columns

更新UPDATE

‘’‘SQL语句’’’

  • UPDATE boston SET CHAS = 1 WHERE INDUS >10 AND ZN = 0

将满足INDUS >10 和ZN > 0的CHAS字段更新为1


df.loc[(df['INDUS']>10) & (df['ZN']==0),'CHAS']=1
df[df['INDUS']>=10]
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
70 0.08826 0.0 10.81 1.0 0.413 6.417 6.6 5.2873 4.0 305.0 19.2 383.73 6.72 24.2
71 0.15876 0.0 10.81 1.0 0.413 5.961 17.5 5.2873 4.0 305.0 19.2 376.94 9.88 21.7
72 0.09164 0.0 10.81 1.0 0.413 6.065 7.8 5.2873 4.0 305.0 19.2 390.91 5.52 22.8
73 0.19539 0.0 10.81 1.0 0.413 6.245 6.2 5.2873 4.0 305.0 19.2 377.17 7.54 23.4
74 0.07896 0.0 12.83 1.0 0.437 6.273 6.0 4.2515 5.0 398.0 18.7 394.92 6.78 24.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 1.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 1.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 1.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 1.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 1.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

236 rows × 14 columns

分组统计

根据CHAS进行分组,返回CHAS和每组的数量

‘’‘SQL语句’’’

  • SELECT CHAS,COUNT(*) FROM boston GROUP BY CHAS;
search = df.groupby('CHAS').size()
search
CHAS
0.0    261
1.0    245
dtype: int64

分组统计 聚合输出

根据CHAS进行分组,返回CHAS,每个组的INDUS和NOX的平均值、最大值

‘’‘SQL语句’’’

  • SELECT CHAS,avg(INDUS),max(INDUS),avg(NOX),max(NOX) FROM boston GROUP BY CHAS
import numpy as np
search = df.groupby('CHAS').agg({'INDUS':[np.mean,np.max],'NOX':[np.mean,np.max]})
search
INDUS NOX
mean amax mean amax
CHAS
0.0 5.485479 15.04 0.481091 0.647
1.0 17.157143 27.74 0.633106 0.871

删除

‘’‘SQL语句’’’

  • DELETE FROM boston WHERE WHERE TAX=305 AND RAD = 4;
drop = df.drop(df[(df['TAX']=305) & (df['RAD']=4)].index)
  File "<ipython-input-90-fff36bdbf4c7>", line 1
    drop = df.drop(df[(df['TAX']=305) & (df['RAD']=4)].index)
                                ^
SyntaxError: invalid syntax
上一篇:Java8-lambda表达式


下一篇:收集Java8 Lambda map-reduce代码