5、运算、统计分析

5、运算、统计分析

实例:运算
import pandas as pd
a=pd.DataFrame({'name':['cindy','alice','ared'],'grade':[80,90,75],'age':[18,19,20]})
b=pd.DataFrame({'name':['cindy','alice','bob','ared'],'grade':[95,70,60,50]})
#表a的grade列计算极差
f=lambda x:x.max()-x.min()
print('a的grade列的极差:')
print(pd.DataFrame(a['grade']).apply(f,axis=0))

#默认按照索引相加,不匹配的索计算引结果为NaN
print('a+b:\n',a+b)
print('a与b的grade列相加:\n',pd.DataFrame(a['grade']+b['grade']))
print(a.loc[:,['grade']]+b.loc[:,['grade']])
#不同索引的相同字段值对应的指标相加:筛选并重置索引,然后计算
print('a的ared+b的ared:\n',pd.DataFrame(a.loc[a['name']=='ared',['grade']]).reset_index(drop=True)+pd.DataFrame(b.loc[b['name']=='ared',['grade']]).reset_index(drop=True))
#将计算结果的空值填充为0
print('将计算结果的空值填充为0:\n',(a+b).fillna(0))
#将不存在的索引对应的值填充为0,然后计算
print('将不存在的索引对应的值填充为0,然后计算:\n',pd.DataFrame(a['grade'].add(b['grade'],fill_value=0)))

runfile('C:/Users/xyy/.spyder-py3/temp.py', wdir='C:/Users/xyy/.spyder-py3')
a的grade列的极差:
grade    15
dtype: int64
a+b:
    age  grade        name
0  NaN  175.0  cindycindy
1  NaN  160.0  alicealice
2  NaN  135.0     aredbob
3  NaN    NaN         NaN
a与b的grade列相加:
    grade
0  175.0
1  160.0
2  135.0
3    NaN
   grade
0  175.0
1  160.0
2  135.0
3    NaN
a的ared+b的ared:
    grade
0    125
将计算结果的空值填充为0:
    age  grade        name
0  0.0  175.0  cindycindy
1  0.0  160.0  alicealice
2  0.0  135.0     aredbob
3  0.0    0.0           0
将不存在的索引对应的值填充为0,然后计算:
    grade
0  175.0
1  160.0
2  135.0
3   50.0
实例:计算两表中相同名称对应的指标和
import pandas as pd
a=pd.DataFrame({'name':['cindy','alice','ared'],'grade':[80,90,75],'age':[18,19,20]})
b=pd.DataFrame({'name':['cindy','alice','bob','ared'],'grade':[95,70,60,50]})
print('法1:按索引计算两表的指标列的和,然后合并两表,将合并结果与指标列的和再通过索引合并')
a_b=pd.merge(a,b,how='right',left_index=True,right_index=True,suffixes=('_a','_b')).fillna(0)
print('a右连接b:\n',a_b)
print('将不存在的索引对应的值填充为0,然后计算:\n',pd.merge(a_b,pd.DataFrame(a['grade'].add(b['grade'],fill_value=0)),left_index=True,right_index=True))
print('\n法2:直接合并表,并新增求和字段')
a_b['grade_sum']=a_b['grade_a']+a_b['grade_b']
print(a_b)

runfile('C:/Users/xyy/.spyder-py3/temp.py', wdir='C:/Users/xyy/.spyder-py3')
法1:按索引计算两表的指标列的和,然后合并两表,将合并结果与指标列的和再通过索引合并
a右连接b:
   name_a  grade_a   age name_b  grade_b
0  cindy     80.0  18.0  cindy       95
1  alice     90.0  19.0  alice       70
2   ared     75.0  20.0    bob       60
3      0      0.0   0.0   ared       50
将不存在的索引对应的值填充为0,然后计算:
   name_a  grade_a   age name_b  grade_b  grade
0  cindy     80.0  18.0  cindy       95  175.0
1  alice     90.0  19.0  alice       70  160.0
2   ared     75.0  20.0    bob       60  135.0
3      0      0.0   0.0   ared       50   50.0

法2:直接合并表,并新增求和字段
  name_a  grade_a   age name_b  grade_b  grade_sum
0  cindy     80.0  18.0  cindy       95      175.0
1  alice     90.0  19.0  alice       70      160.0
2   ared     75.0  20.0    bob       60      135.0
3      0      0.0   0.0   ared       50       50.0
实例:分类统计
import pandas as pd
import numpy as np
a=pd.DataFrame({'sex':['female','female','male','female'],'name':['cindy','alice','ared','cindy'],'grade':[80,90,75,70],'sale':[18,19,20,23]})
print('各人最大值:\n',a.groupby('name').max())
print('各性别各人最小值:\n',a.groupby(['sex','name']).min())
print('各性别各人销售总额:\n',a.groupby(['sex','name'])['sale'].sum())
print('各性别各人的指标个数、指标和、指标均值:\n',a.groupby(['sex','name'])['grade','sale'].agg([len,np.sum,np.mean]))
print('透视:\n',pd.pivot_table(a,index=['name'],columns=['sex'],values=['grade','sale'],aggfunc=[np.sum,np.mean],fill_value=0,margins=True))
print('透视总分、平均销售:\n',pd.pivot_table(a,index=['name'],columns=['sex'],values=['grade','sale'],aggfunc={'grade':np.sum,'sale':np.mean},fill_value=0,margins=True))

runfile('C:/Users/xyy/.spyder-py3/temp.py', wdir='C:/Users/xyy/.spyder-py3')
各人最大值:
           sex  grade  sale
name                      
alice  female     90    19
ared     male     75    20
cindy  female     80    23
各性别各人最小值:
               grade  sale
sex    name              
female alice     90    19
       cindy     70    18
male   ared      75    20
各性别各人销售总额:
 sex     name 
female  alice    19
        cindy    41
male    ared     20
Name: sale, dtype: int64
各性别各人的指标个数、指标和、指标均值:
              grade           sale          
               len  sum mean  len sum  mean
sex    name                                
female alice     1   90   90    1  19  19.0
       cindy     2  150   75    2  41  20.5
male   ared      1   75   75    1  20  20.0
透视:
          sum                             mean                             
       grade             sale           grade              sale           
sex   female male  All female male All female male   All female male   All
name                                                                      
alice     90    0   90     19    0  19     90    0  90.0   19.0    0  19.0
ared       0   75   75      0   20  20      0   75  75.0    0.0   20  20.0
cindy    150    0  150     41    0  41     75    0  75.0   20.5    0  20.5
All      240   75  315     60   20  80     80   75  78.0   20.0   20  20.0
透视总分、平均分、平均销售:
        grade             sale           
sex   female male  All female male   All
name                                    
alice     90    0   90   19.0    0  19.0
ared       0   75   75    0.0   20  20.0
cindy    150    0  150   20.5    0  20.5
All      240   75  315   20.0   20  20.0
上一篇:08 学生课程分数的Spark SQL分析


下一篇:SQL命令:查找表中的数据,删除表中的数据