【集成学习(下)】Task14 幸福感预测

文章目录

前言

一、目的和要求
理解k-近邻算法的原理,掌握k-近邻算法的应用开发。

二、主要内容
实例:糖尿病预测
任务:预测Pima 印度安人的糖尿病
数据来源:

  1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
  2. 在实验1文件夹里,pima-indians-diabetes
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,precision_score, \
recall_score,f1_score,cohen_kappa_score
from collections import Counter
from sklearn.metrics import roc_curve,auc
data = pd.read_csv('./diabetes.csv')

数据探索

data.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

数据说明

  • Pregnancies : 怀孕次数
  • Glucose : 口服葡萄糖耐量测试中2小时的血浆葡萄糖浓度
  • BloodPressure : 舒张压(毫米汞柱)
  • SkinThickness : 三头肌皮肤褶皱厚度(毫米)
  • Insulin : 2小时血清胰岛素(mu U / ml)
  • BMI : 体重指数(体重(kg)/(身高(m))^ 2)
  • DiabetesPedigreeFunction : 糖尿病谱系功能
  • Age : 年龄
  • Outcome : 结果
data.describe()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
data.shape
(768, 9)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data['Outcome'].value_counts()
0    500
1    268
Name: Outcome, dtype: int64

小结

  • 1.数据没有缺失值
  • 2.数据类别没有不平衡的问题,数据非常好!!,一点不脏,不过这里做的是k-近邻算法
  • 3.是二分类问题

模型构建

数据标准化

new_data = data.drop([ 'Outcome'], axis=1)
scale = MinMaxScaler().fit(new_data)## 训练规则
biao_data = scale.transform(new_data) ## 应用规则

划分训练集和测试集

X_train,X_test,y_train,y_test = train_test_split(biao_data,data['Outcome'],test_size=0.2,random_state=123)
# 模型训练
k = 5
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)
KNeighborsClassifier()
y_pred = clf.predict(X_test)

模型评估

fpr,tpr,threshold = roc_curve(y_test, y_pred)
print('数据的AUC为:',auc(fpr,tpr))
print('数据的准确率为:',accuracy_score(y_test,y_pred))
print('数据的精确率为:',precision_score(y_test,y_pred))
print('数据的召回率为:',recall_score(y_test,y_pred))
print('数据的F1值为:',f1_score(y_test,y_pred))
print('数据的Cohen’s Kappa系数为:',cohen_kappa_score(y_test,y_pred))
print('Counter:',Counter(y_pred))
数据的AUC为: 0.7634698275862069
数据的准确率为: 0.7987012987012987
数据的精确率为: 0.8
数据的召回率为: 0.6206896551724138
数据的F1值为: 0.6990291262135923
数据的Cohen’s Kappa系数为: 0.5514001127607593
Counter: Counter({0: 109, 1: 45})

总结

自己太废了
【集成学习(下)】Task14 幸福感预测

参考

  • https://zhuanlan.zhihu.com/p/25994179
  • https://www.cnblogs.com/ahu-lichang/p/7151007.html
  • https://zhuanlan.zhihu.com/p/122195108
  • Datawhale GitHub开源
上一篇:jmeter Response message: Non HTTP response message: 无法指定被请求的地址 (connect failed)


下一篇:Oracle11g中数据库表、索引、视图、同义词的管理(使用OEM工具)(一)