Pandas数据分析----缺失值统计与分析

工具包导入

(9月29号(组内)–数据分析)

import pandas as pd
print(pd.__version__)
1.2.4

数据载入

data1 = pd.read_csv('./datasets1/location_object.csv')
print(data1.head(5))
             TIME  K1-1  K1-2  K1-3  K1-4  K1-5  K1-6  K2-1  K2-2  K2-3  ...  \
0  2020/7/19 0:00   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
1  2020/7/19 1:00   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
2  2020/7/19 2:00   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
3  2020/7/19 3:00   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
4  2020/7/19 4:00   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   

   K4-5  K4-6  K4-7  K4-8  K5-1  K5-2  K5-3  K5-4  K5-5  K5-6  
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
4   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  

[5 rows x 39 columns]

数据概览

data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   TIME    624 non-null    object 
 1   K1-1    81 non-null     float64
 2   K1-2    43 non-null     float64
 3   K1-3    74 non-null     float64
 4   K1-4    43 non-null     float64
 5   K1-5    66 non-null     float64
 6   K1-6    19 non-null     float64
 7   K2-1    31 non-null     float64
 8   K2-2    34 non-null     float64
 9   K2-3    130 non-null    float64
 10  K2-4    119 non-null    float64
 11  K2-5    77 non-null     float64
 12  K2-6    105 non-null    float64
 13  K2-7    59 non-null     float64
 14  K2-8    24 non-null     float64
 15  K3-1-1  49 non-null     float64
 16  K3-1-2  22 non-null     float64
 17  K3-2    91 non-null     float64
 18  K3-3    78 non-null     float64
 19  K3-4    85 non-null     float64
 20  K3-5    102 non-null    float64
 21  K3-6    79 non-null     float64
 22  K3-7    90 non-null     float64
 23  K3-8-1  15 non-null     float64
 24  K3-8-2  13 non-null     float64
 25  K4-1    47 non-null     float64
 26  K4-2    63 non-null     float64
 27  K4-3    128 non-null    float64
 28  K4-4    61 non-null     float64
 29  K4-5    58 non-null     float64
 30  K4-6    96 non-null     float64
 31  K4-7    28 non-null     float64
 32  K4-8    41 non-null     float64
 33  K5-1    54 non-null     float64
 34  K5-2    72 non-null     float64
 35  K5-3    65 non-null     float64
 36  K5-4    51 non-null     float64
 37  K5-5    68 non-null     float64
 38  K5-6    30 non-null     float64
dtypes: float64(38), object(1)
memory usage: 190.2+ KB
# 数据统计描述
data1.describe()
K1-1 K1-2 K1-3 K1-4 K1-5 K1-6 K2-1 K2-2 K2-3 K2-4 ... K4-5 K4-6 K4-7 K4-8 K5-1 K5-2 K5-3 K5-4 K5-5 K5-6
count 81.0 43.0 74.0 43.0 66.0 19.0 31.0 34.0 130.0 119.0 ... 58.0 96.0 28.0 41.0 54.0 72.0 65.0 51.0 68.0 30.0
mean 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
min 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
25% 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
50% 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
75% 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
max 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

8 rows × 38 columns

缺失值统计

# 缺失值可视化
import missingno as msn
msn.matrix(data1)
<matplotlib.axes._subplots.AxesSubplot at 0x268bd314a08>

Pandas数据分析----缺失值统计与分析

统计每一列的缺失值个数

# 非空值统计,求取每个列的非空值个数
print(data1.count())
TIME      624
K1-1       81
K1-2       43
K1-3       74
K1-4       43
K1-5       66
K1-6       19
K2-1       31
K2-2       34
K2-3      130
K2-4      119
K2-5       77
K2-6      105
K2-7       59
K2-8       24
K3-1-1     49
K3-1-2     22
K3-2       91
K3-3       78
K3-4       85
K3-5      102
K3-6       79
K3-7       90
K3-8-1     15
K3-8-2     13
K4-1       47
K4-2       63
K4-3      128
K4-4       61
K4-5       58
K4-6       96
K4-7       28
K4-8       41
K5-1       54
K5-2       72
K5-3       65
K5-4       51
K5-5       68
K5-6       30
dtype: int64
# 上述统计形式等价于
print(data1.count(axis=0))
TIME      624
K1-1       81
K1-2       43
K1-3       74
K1-4       43
K1-5       66
K1-6       19
K2-1       31
K2-2       34
K2-3      130
K2-4      119
K2-5       77
K2-6      105
K2-7       59
K2-8       24
K3-1-1     49
K3-1-2     22
K3-2       91
K3-3       78
K3-4       85
K3-5      102
K3-6       79
K3-7       90
K3-8-1     15
K3-8-2     13
K4-1       47
K4-2       63
K4-3      128
K4-4       61
K4-5       58
K4-6       96
K4-7       28
K4-8       41
K5-1       54
K5-2       72
K5-3       65
K5-4       51
K5-5       68
K5-6       30
dtype: int64

统计某一列的缺失值个数

print(data1[['K5-6']].count())
K5-6    30
dtype: int64

统计多个列的缺失值个数

print(data1[['K5-6', 'K5-5']].count())
K5-6    30
K5-5    68
dtype: int64

统计每一行的缺失值个数

# 求取每一行的缺失值个数
print(data1.count(axis=1))
0      1
1      1
2      2
3      2
4      1
      ..
619    1
620    1
621    1
622    1
623    1
Length: 624, dtype: int64

统计某一行的缺失值个数

print(data1.iloc[[0]].count())
1

统计多行的缺失值个数

print(data1.iloc[[0, 1]].count())
TIME      2
K1-1      0
K1-2      0
K1-3      0
K1-4      0
K1-5      0
K1-6      0
K2-1      0
K2-2      0
K2-3      0
K2-4      0
K2-5      0
K2-6      0
K2-7      0
K2-8      0
K3-1-1    0
K3-1-2    0
K3-2      0
K3-3      0
K3-4      0
K3-5      0
K3-6      0
K3-7      0
K3-8-1    0
K3-8-2    0
K4-1      0
K4-2      0
K4-3      0
K4-4      0
K4-5      0
K4-6      0
K4-7      0
K4-8      0
K5-1      0
K5-2      0
K5-3      0
K5-4      0
K5-5      0
K5-6      0
dtype: int64

对每一列进行求和

print(data1.sum())
TIME      2020/7/19 0:002020/7/19 1:002020/7/19 2:002020...
K1-1                                                   81.0
K1-2                                                   43.0
K1-3                                                   74.0
K1-4                                                   43.0
K1-5                                                   66.0
K1-6                                                   19.0
K2-1                                                   31.0
K2-2                                                   34.0
K2-3                                                  130.0
K2-4                                                  119.0
K2-5                                                   77.0
K2-6                                                  105.0
K2-7                                                   59.0
K2-8                                                   24.0
K3-1-1                                                 49.0
K3-1-2                                                 22.0
K3-2                                                   91.0
K3-3                                                   78.0
K3-4                                                   85.0
K3-5                                                  102.0
K3-6                                                   79.0
K3-7                                                   90.0
K3-8-1                                                 15.0
K3-8-2                                                 13.0
K4-1                                                   47.0
K4-2                                                   63.0
K4-3                                                  128.0
K4-4                                                   61.0
K4-5                                                   58.0
K4-6                                                   96.0
K4-7                                                   28.0
K4-8                                                   41.0
K5-1                                                   54.0
K5-2                                                   72.0
K5-3                                                   65.0
K5-4                                                   51.0
K5-5                                                   68.0
K5-6                                                   30.0
dtype: object

对每一行进行求和

print(data1.sum(axis=1))
0      0.0
1      0.0
2      1.0
3      1.0
4      0.0
      ... 
619    0.0
620    0.0
621    0.0
622    0.0
623    0.0
Length: 624, dtype: float64

对单独的一行或一列进行操作

# 对某一列进行求和
print(data1['K5-6'].sum())
30.0
# 对某一行进行求和
print(data1.iloc[[0]].sum())
TIME      2020/7/19 0:00
K1-1                 0.0
K1-2                 0.0
K1-3                 0.0
K1-4                 0.0
K1-5                 0.0
K1-6                 0.0
K2-1                 0.0
K2-2                 0.0
K2-3                 0.0
K2-4                 0.0
K2-5                 0.0
K2-6                 0.0
K2-7                 0.0
K2-8                 0.0
K3-1-1               0.0
K3-1-2               0.0
K3-2                 0.0
K3-3                 0.0
K3-4                 0.0
K3-5                 0.0
K3-6                 0.0
K3-7                 0.0
K3-8-1               0.0
K3-8-2               0.0
K4-1                 0.0
K4-2                 0.0
K4-3                 0.0
K4-4                 0.0
K4-5                 0.0
K4-6                 0.0
K4-7                 0.0
K4-8                 0.0
K5-1                 0.0
K5-2                 0.0
K5-3                 0.0
K5-4                 0.0
K5-5                 0.0
K5-6                 0.0
dtype: object

对多个行或多个列进行操作

# 对多个列求和
print(data1[['K5-6', 'K5-5']].sum())
K5-6    30.0
K5-5    68.0
dtype: float64
# 对多行进行求和
print(data1.iloc[[0, 1]].sum())
TIME      2020/7/19 0:002020/7/19 1:00
K1-1                               0.0
K1-2                               0.0
K1-3                               0.0
K1-4                               0.0
K1-5                               0.0
K1-6                               0.0
K2-1                               0.0
K2-2                               0.0
K2-3                               0.0
K2-4                               0.0
K2-5                               0.0
K2-6                               0.0
K2-7                               0.0
K2-8                               0.0
K3-1-1                             0.0
K3-1-2                             0.0
K3-2                               0.0
K3-3                               0.0
K3-4                               0.0
K3-5                               0.0
K3-6                               0.0
K3-7                               0.0
K3-8-1                             0.0
K3-8-2                             0.0
K4-1                               0.0
K4-2                               0.0
K4-3                               0.0
K4-4                               0.0
K4-5                               0.0
K4-6                               0.0
K4-7                               0.0
K4-8                               0.0
K5-1                               0.0
K5-2                               0.0
K5-3                               0.0
K5-4                               0.0
K5-5                               0.0
K5-6                               0.0
dtype: object

可视化分析

import matplotlib.pyplot as plt

plt.figure(figsize=(24, 7))
plt.rcParams['font.family'] = 'SimHei'

plt.bar(data1.columns, list(data1.count(axis=0)), width=1.5)
plt.title('非空值个数统计')

plt.show()

Pandas数据分析----缺失值统计与分析

import plotly as py
import plotly.graph_objs as go
pyplt = py.offline.plot
# Trace
trace_basic = [go.Bar(
            x = data1.columns,
            y = list(data1.count(axis=0)),
    )]
# Layout
layout_basic = go.Layout(
            title = '非空值个数统计')
# Figure
figure_basic = go.Figure(data = trace_basic, layout = layout_basic)
# Plot
pyplt(figure_basic, filename='./1.html')
'./1.html'

Pandas数据分析----缺失值统计与分析

上一篇:1.3 IDAE 中使用GO开发项目


下一篇:使用接口来统一控件的取值、赋值和初始化