我们都知道SQL数据库的语句很简洁,python中的pandas库也很好用,但是如何将两者给关联起来???
本文将通过pandas来实现类似于SQL中的【过滤】、【排序】、【关联】、【合并】、【更新】、【删除】等操作。
用到的数据集
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
df = pd.DataFrame(boston.data,columns = boston.feature_names)
df['target'] =pd.Series(boston.target)
df
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
简单的字段查询
‘’‘SQL语句’’’
- SELECT CRIM,ZN,CHAS,NOX,RM,RAD FROM boston LIMIT 2;
实现返回每行记录的CRIM,ZN,CHAS,NOX,RM,RAD字段,返回2行。
search = df[["CRIM","ZN","CHAS","NOX","RM","RAD"]].head(2)
search
CRIM | ZN | CHAS | NOX | RM | RAD | |
---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 0.0 | 0.538 | 6.575 | 1.0 |
1 | 0.02731 | 0.0 | 0.0 | 0.469 | 6.421 | 2.0 |
简单的条件过滤 WHERE
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE CHAS=1 LIMIT 2;
set(df['CHAS']) # 我们将CHAS特征作为分类标签
{0.0, 1.0}
search = df[df['CHAS']==1].head(2)
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
142 | 3.32105 | 0.0 | 19.58 | 1.0 | 0.871 | 5.403 | 100.0 | 1.3216 | 5.0 | 403.0 | 14.7 | 396.90 | 26.82 | 13.4 |
152 | 1.12658 | 0.0 | 19.58 | 1.0 | 0.871 | 5.012 | 88.0 | 1.6102 | 5.0 | 403.0 | 14.7 | 343.28 | 12.12 | 15.3 |
多条件与或查询WHERE AND|OR
如上满足SQL实现查询同时满足CHAS=1和CRIM>1.0两个条件的记录,返回两行
与关系&
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE CHAS=1 AND CRIM>=1.0 LIMIT 2;
search = df[(df['CHAS']==1) & (df['CRIM'] >= 1.0)] .head(2) # 注意这里条件是用的圆括号!!!不是方括号
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
142 | 3.32105 | 0.0 | 19.58 | 1.0 | 0.871 | 5.403 | 100.0 | 1.3216 | 5.0 | 403.0 | 14.7 | 396.90 | 26.82 | 13.4 |
152 | 1.12658 | 0.0 | 19.58 | 1.0 | 0.871 | 5.012 | 88.0 | 1.6102 | 5.0 | 403.0 | 14.7 | 343.28 | 12.12 | 15.3 |
或关系|
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE CHAS=1 OR CRIM>=1.0 LIMIT 2;
search = df[(df['CHAS']==1) | (df['CRIM'] >= 1.0)] .head(2) # 注意这里条件是用的圆括号!!!不是方括号
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 1.05393 | 0.0 | 8.14 | 0.0 | 0.538 | 5.935 | 29.3 | 4.4986 | 4.0 | 307.0 | 21.0 | 386.85 | 6.58 | 23.1 |
20 | 1.25179 | 0.0 | 8.14 | 0.0 | 0.538 | 5.570 | 98.1 | 3.7979 | 4.0 | 307.0 | 21.0 | 376.57 | 21.02 | 13.6 |
条件过滤 控制判断
空判断 查询null的记录
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE CHAS IS NULL;
search = df[df['CHAS'].isna()]# 注意这里条件是用的圆括号!!!不是方括号
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
非空判断is not null
查询不为null的记录
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE CHAS IS NOT NULL;
search = df[df['CHAS'].notna()]# 注意这里条件是用的圆括号!!!不是方括号
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
排序 ORDER BY ASC|DESC
‘’‘SQL语句’’’
- SELECT * FROM boston WHERE INDUS>10 ORDER BY DESC CHAS;
满足INDUS>10的值按照CHAS进行降序排列
search = df[(df['INDUS']>10)].sort_values(by='CHAS',ascending=False)
search
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
162 | 1.83377 | 0.0 | 19.58 | 1.0 | 0.605 | 7.802 | 98.2 | 2.0407 | 5.0 | 403.0 | 14.7 | 389.61 | 1.92 | 50.0 |
212 | 0.21719 | 0.0 | 10.59 | 1.0 | 0.489 | 5.807 | 53.8 | 3.6526 | 4.0 | 277.0 | 18.6 | 390.94 | 16.03 | 22.4 |
364 | 3.47428 | 0.0 | 18.10 | 1.0 | 0.718 | 8.780 | 82.9 | 1.9047 | 24.0 | 666.0 | 20.2 | 354.55 | 5.29 | 21.9 |
363 | 4.22239 | 0.0 | 18.10 | 1.0 | 0.770 | 5.803 | 89.0 | 1.9047 | 24.0 | 666.0 | 20.2 | 353.04 | 14.64 | 16.8 |
357 | 3.84970 | 0.0 | 18.10 | 1.0 | 0.770 | 6.395 | 91.0 | 2.5052 | 24.0 | 666.0 | 20.2 | 391.34 | 13.27 | 21.7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
293 | 0.08265 | 0.0 | 13.92 | 0.0 | 0.437 | 6.127 | 18.4 | 5.5027 | 4.0 | 289.0 | 16.0 | 396.90 | 8.58 | 23.9 |
294 | 0.08199 | 0.0 | 13.92 | 0.0 | 0.437 | 6.009 | 42.3 | 5.5027 | 4.0 | 289.0 | 16.0 | 396.90 | 10.40 | 21.7 |
295 | 0.12932 | 0.0 | 13.92 | 0.0 | 0.437 | 6.678 | 31.1 | 5.9604 | 4.0 | 289.0 | 16.0 | 396.90 | 6.27 | 28.6 |
296 | 0.05372 | 0.0 | 13.92 | 0.0 | 0.437 | 6.549 | 51.0 | 5.9604 | 4.0 | 289.0 | 16.0 | 392.85 | 7.39 | 27.1 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
236 rows × 14 columns
更新UPDATE
‘’‘SQL语句’’’
- UPDATE boston SET CHAS = 1 WHERE INDUS >10 AND ZN = 0
将满足INDUS >10 和ZN > 0的CHAS字段更新为1
df.loc[(df['INDUS']>10) & (df['ZN']==0),'CHAS']=1
df[df['INDUS']>=10]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
70 | 0.08826 | 0.0 | 10.81 | 1.0 | 0.413 | 6.417 | 6.6 | 5.2873 | 4.0 | 305.0 | 19.2 | 383.73 | 6.72 | 24.2 |
71 | 0.15876 | 0.0 | 10.81 | 1.0 | 0.413 | 5.961 | 17.5 | 5.2873 | 4.0 | 305.0 | 19.2 | 376.94 | 9.88 | 21.7 |
72 | 0.09164 | 0.0 | 10.81 | 1.0 | 0.413 | 6.065 | 7.8 | 5.2873 | 4.0 | 305.0 | 19.2 | 390.91 | 5.52 | 22.8 |
73 | 0.19539 | 0.0 | 10.81 | 1.0 | 0.413 | 6.245 | 6.2 | 5.2873 | 4.0 | 305.0 | 19.2 | 377.17 | 7.54 | 23.4 |
74 | 0.07896 | 0.0 | 12.83 | 1.0 | 0.437 | 6.273 | 6.0 | 4.2515 | 5.0 | 398.0 | 18.7 | 394.92 | 6.78 | 24.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 1.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 1.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 1.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 1.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 1.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
236 rows × 14 columns
分组统计
根据CHAS进行分组,返回CHAS和每组的数量
‘’‘SQL语句’’’
- SELECT CHAS,COUNT(*) FROM boston GROUP BY CHAS;
search = df.groupby('CHAS').size()
search
CHAS
0.0 261
1.0 245
dtype: int64
分组统计 聚合输出
根据CHAS进行分组,返回CHAS,每个组的INDUS和NOX的平均值、最大值
‘’‘SQL语句’’’
- SELECT CHAS,avg(INDUS),max(INDUS),avg(NOX),max(NOX) FROM boston GROUP BY CHAS
import numpy as np
search = df.groupby('CHAS').agg({'INDUS':[np.mean,np.max],'NOX':[np.mean,np.max]})
search
INDUS | NOX | |||
---|---|---|---|---|
mean | amax | mean | amax | |
CHAS | ||||
0.0 | 5.485479 | 15.04 | 0.481091 | 0.647 |
1.0 | 17.157143 | 27.74 | 0.633106 | 0.871 |
删除
‘’‘SQL语句’’’
- DELETE FROM boston WHERE WHERE TAX=305 AND RAD = 4;
drop = df.drop(df[(df['TAX']=305) & (df['RAD']=4)].index)
File "<ipython-input-90-fff36bdbf4c7>", line 1
drop = df.drop(df[(df['TAX']=305) & (df['RAD']=4)].index)
^
SyntaxError: invalid syntax