机器学习sklearn（五）：数据处理（二）缺失值处理

2024-03-07 11:01:20

来源 https://www.cnblogs.com/B-Hanan/articles/12774433.html

1 单变量缺失

import numpy as np
from sklearn.impute import SimpleImputer

help(SimpleImputer):

class SimpleImputer(_BaseImputer):Imputation transformer for completing missing values.

Parameters(参数设置)

missing_values(缺失值类型) : number, string, np.nan (default) or None

The placeholder for the missing values. All occurrences of missing_values will be imputed.

strategy : string, default='mean'

The imputation strategy.

If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.
If "median", then replace missing values using the median along each column. Can only be used with numeric data.
If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.strategy="constant" for fixed value imputation.

fill_value : string or numerical value, default=None

When strategy == "constant", fill_value is used to replace all occurrences of missing_values.If left to the default, fill_value will be 0 when imputing numericaldata and "missing_value" for strings or object data types.

imp=SimpleImputer(missing_values=np.nan,strategy='mean')
imp.fit([[1,2],[np.nan,3],[7,6]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

##SimpleImputer类支持稀疏矩阵
import scipy.sparse as sp
X=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp=SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(X)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=-1, strategy='mean', verbose=0)

X_test=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(X_test))

  (0, 0)    3.0
  (1, 0)    6.0
  (2, 0)    7.0
  (0, 1)    2.0
  (1, 1)    3.0
  (2, 1)    6.0

print(imp.transform(X_test).toarray())
[[3. 2.]
 [6. 3.]
 [7. 6.]]

import pandas as pd
df=pd.DataFrame([['a','x'],
                [np.nan,'y'],
                ['a',np.nan],
                ['b','y']],dtype='category')

df

	0	1
0	a	x
1	NaN	y
2	a	NaN
3	b	y

imp=SimpleImputer(strategy='most_frequent')
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

2 多元特征估计

使用IterativeImputer类，它将每一个特征的缺失值建模为其它特性的函数，并使用该估计值进行估计。
工作模式：迭代循环
在每一步，都指定一个功能列出作为输出

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp=IterativeImputer(max_iter=10,random_state=0)
imp.fit([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

imp.transform([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

array([[ 1.        ,  2.        ],
       [ 3.        ,  6.        ],
       [ 4.        ,  8.        ],
       [ 1.50004509,  3.        ],
       [ 7.        , 14.00004135]])

X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
print(imp.transform(X_test))

[[ 1.00007297  2.        ]
 [ 6.         12.00002754]
 [ 2.99996145  6.        ]]

3 K-近邻法

这个KNNImputer类提供了使用k-最近邻方法填充缺失值的估算。默认情况下，支持缺失值的欧氏距离度量，nan_euclidean_distances，用于查找最近的邻居。每个缺失的特性都使用n_neighbors具有该功能值的最近邻居。

from sklearn.impute import KNNImputer

help(KNNImputer)：

Imputation for completing missing values using k-Nearest Neighbors.

(使用k近邻方法补全缺失值。)

Each sample's missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

(每个样本的缺失值是使用在训练集中找到的最近邻居的‘n_neighbors’的平均值来推算的.如果两个都不缺少的要素都不接近，则两个样本是接近的。)

Parameters：

missing_values : number, string, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed.

n_neighbors : int, default=5 Number of neighboring samples to use for imputation.

weights : {'uniform', 'distance'} or callable, default='uniform' Weight function used in prediction.

import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
print(X)
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(X)

[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]





array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

4 标记推算值

这个MissingIndicator转换器用于将数据集转换为相应的二进制矩阵，以指示数据集中是否存在缺失值。这种转换与计算相结合是很有用的。在使用估算时，保存有关哪些值丢失的信息可以提供信息。

from sklearn.impute import MissingIndicator

help(MissingIndicator):

class MissingIndicator(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)

Binary indicators for missing values(缺失值的二进制指示符).

MissingIndicator(missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)

X = np.array([[-1, -1, 1, 3],
              [4, -1, 0, -1],
              [8, -1, 1, 0]])
indicator = MissingIndicator(missing_values=-1)
mask_missing_values_only = indicator.fit_transform(X)
mask_missing_values_only

array([[ True,  True, False],
       [False,  True,  True],
       [False,  True, False]])

#只返回存在缺失值的列的索引
indicator.features_

array([0, 1, 2, 3])

#这个features参数可以设置为'all'若要返回所有特征，无论它们是否包含缺失的值
indicator = MissingIndicator(missing_values=-1, features="all")
mask_all = indicator.fit_transform(X)
mask_all

array([[ True,  True, False, False],
       [False,  True, False,  True],
       [False,  True, False, False]])

indicator.features_
#特征所在的列索引

array([0, 1, 2, 3])

码农公寓

1 单变量缺失

2 多元特征估计

3 K-近邻法

4 标记推算值

相关文章