文章目录
前言
一、目的和要求
理解k-近邻算法的原理,掌握k-近邻算法的应用开发。
二、主要内容
实例:糖尿病预测
任务:预测Pima 印度安人的糖尿病
数据来源:
- https://www.kaggle.com/uciml/pima-indians-diabetes-database
- 在实验1文件夹里,pima-indians-diabetes
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,precision_score, \
recall_score,f1_score,cohen_kappa_score
from collections import Counter
from sklearn.metrics import roc_curve,auc
data = pd.read_csv('./diabetes.csv')
数据探索
data.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
数据说明
- Pregnancies : 怀孕次数
- Glucose : 口服葡萄糖耐量测试中2小时的血浆葡萄糖浓度
- BloodPressure : 舒张压(毫米汞柱)
- SkinThickness : 三头肌皮肤褶皱厚度(毫米)
- Insulin : 2小时血清胰岛素(mu U / ml)
- BMI : 体重指数(体重(kg)/(身高(m))^ 2)
- DiabetesPedigreeFunction : 糖尿病谱系功能
- Age : 年龄
- Outcome : 结果
data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
data.shape
(768, 9)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data['Outcome'].value_counts()
0 500
1 268
Name: Outcome, dtype: int64
小结
- 1.数据没有缺失值
- 2.数据类别没有不平衡的问题,数据非常好!!,一点不脏,不过这里做的是k-近邻算法
- 3.是二分类问题
模型构建
数据标准化
new_data = data.drop([ 'Outcome'], axis=1)
scale = MinMaxScaler().fit(new_data)## 训练规则
biao_data = scale.transform(new_data) ## 应用规则
划分训练集和测试集
X_train,X_test,y_train,y_test = train_test_split(biao_data,data['Outcome'],test_size=0.2,random_state=123)
# 模型训练
k = 5
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)
KNeighborsClassifier()
y_pred = clf.predict(X_test)
模型评估
fpr,tpr,threshold = roc_curve(y_test, y_pred)
print('数据的AUC为:',auc(fpr,tpr))
print('数据的准确率为:',accuracy_score(y_test,y_pred))
print('数据的精确率为:',precision_score(y_test,y_pred))
print('数据的召回率为:',recall_score(y_test,y_pred))
print('数据的F1值为:',f1_score(y_test,y_pred))
print('数据的Cohen’s Kappa系数为:',cohen_kappa_score(y_test,y_pred))
print('Counter:',Counter(y_pred))
数据的AUC为: 0.7634698275862069
数据的准确率为: 0.7987012987012987
数据的精确率为: 0.8
数据的召回率为: 0.6206896551724138
数据的F1值为: 0.6990291262135923
数据的Cohen’s Kappa系数为: 0.5514001127607593
Counter: Counter({0: 109, 1: 45})
总结
自己太废了
参考
- https://zhuanlan.zhihu.com/p/25994179
- https://www.cnblogs.com/ahu-lichang/p/7151007.html
- https://zhuanlan.zhihu.com/p/122195108
- Datawhale GitHub开源