machine_learning

一、人工智能,机器学习与深度学习

1. 机器学习

1.1 * 经典机器学习

少量的数据,复杂的算法

1.2 基于神经网络的机器学习

海量的数据,简单的算法

  • 浅层学习
  • 深层学习(深度学习)

1.3 强化学习

1.4 迁移学习

二、机器学习的基本类型

1. 有监督学习

根据已知的输入和输出,建立联系他们的模型,根据该模型对未知输出的输入进行判断。

1.1 回归

以无限连续域的形式表示输出

1.2 分类

以有限离散域的形式表示输出

2. 无监督学习

在一组没有已知输出(标签)的输入中,根据数据的内部特征和联系,找到某种规则,进行族群划分——聚类。

3. 半监督学习

从一个相对有限的已知结构中利用有监督学习的方法,构建基本的模型,通过对未知输入和已知输入的对比,判断其输出,扩展原有的已知领域。

4、机器学习的基本过程

  • 数据采集 -> 数据清洗 -> 数据预处理 -> 选择模型 -> 训练模型 -> 测试模型 -> 使用模型
  • 原材料——>去除杂质——>准备————>算法————>规则——>检验———>业务生产

5、数据预处理

x x x x x -> 一行一样本 y y y
x x x x x 《样本矩阵》 y y y
x x x x x —————— y y y
|
v
一列一特征
姓名 年龄 身高 体重 …
张飞 22 1.75 60
赵云 20 1.80 70

1. 均值移除

为了统一样本矩阵中不同特征的基准值和分散度,可以将各个特征的平均值调整为0,标准差调整为1,这个过程称为均值移除。
a b c
m = ( a + b + c ) 3 m=\frac {(a+b+c)} {3} m=3(a+b+c)​
a-m b-m c-m
m ′ = ( a − m + b − m + c − m ) 3 = ( a + b + c ) 3 − 3 m 3 = 0 m' = \frac {(a-m + b-m + c-m)} {3}=\frac {(a+b+c)} {3} - \frac {3m} {3}=0 m′=3(a−m+b−m+c−m)​=3(a+b+c)​−33m​=0
A B C
s = ( A 2 + B 2 + C 2 ) 3 s = \frac {\sqrt{(A^2+B^2+C^2)}} {3} s=3(A2+B2+C2) ​​
A s B s C s \frac {A} {s} \frac {B} {s} \frac {C} {s} sA​sB​sC​
s ′ = ( A 2 / s 2 + B 2 / s 2 + C 2 / s 2 ) 3 = ( A 2 + B 2 + C 2 ) / s 2 3 = s 2 s 2 = 1 s' = \frac {\sqrt{(A^2/s^2 + B^2/s^2 + C^2/s^2)}} {3}=\frac {\sqrt{(A^2+B^2+C^2)/s^2}} {3}=\sqrt{\frac {s^2} {s^2}}=1 s′=3(A2/s2+B2/s2+C2/s2) ​​=3(A2+B2+C2)/s2 ​​=s2s2​ ​=1

sklearn.preprocessing.scale(原始样本矩阵) -> 均值移除后的样本矩阵

代码:std.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5, 2, -5.4],
    [0, 4, -0.3, 2.1],
    [1, 3.3, -1.9, -4.3]
])
print(raw_samples)
print(raw_samples.mean(axis=0))
print(raw_samples.std(axis=0))
std_samples = raw_samples.copy()
for col in std_samples.T:
    col_mean = col.mean()
    col_std = col.std()
    col -= col_mean
    col /= col_std

print(std_samples)
print(std_samples.mean(axis=0))
print(std_samples.std(axis=0))

std_samples = sp.scale(raw_samples)
print(std_samples)
print(std_samples.mean(axis=0))
print(std_samples.std(axis=0))

2. 范围缩放

统一样本矩阵中不同特征的最大值和最小值范围。
k x + b = y
k min + b = min’
k max + b = max’

sklearn.preprocessing.MinMaxScaler(feature_range=期望最小最大值) -> 范围缩放器
范围缩放器.fit_transform(原始样本矩阵) -> 范围缩放后的样本矩阵
代码:mms.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5, 2, -5.4],
    [0, 4, -0.3, 2.1],
    [1, 3.3, -1.9, -4.3]
])
print(raw_samples)
mms_samples = raw_samples.copy()
for col in mms_samples.T:
    col_min = col.min()
    col_max = col.max()
    a = np.array([
        [col_min, 1],
        [col_max, 1],
    ])
    b = np.array([0, 1])
    x = np.linalg.lstsq(a, b, rcond=None)[0]
    col *= x[0]
    col += x[1]
print(mms_samples)

mms = sp.MinMaxScaler(feature_range=(0, 1))
mms_samples = mms.fit_transform(raw_samples)
print(mms_samples)

3. 归一化

为了用占比表示特征,用每个样本的特征值除以该样本的特征值绝对值之和,以使每个样本的特征值的绝对值之和为1。
注意:是对每一行样本进行处理

Date Python Java C/C++ PHP Rate
2016 30 50 40 20 30/140
2017 20 30 20 10 20/80

sklearn.preprocessing.normalize(原始样本矩阵, norm=“l1”) -> 归一化后的样本矩阵
l1即L1范数,矢量中各元素绝对值之和。
代码:nor.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5, 2, -5.4],
    [0, 4, -0.3, 2.1],
    [1, 3.3, -1.9, -4.3]
])
print(raw_samples)
nor_samples = raw_samples.copy()
for row in nor_samples:
    row_absum = abs(row).sum()
    row /= row_absum
print(nor_samples)

nor_samples = sp.normalize(raw_samples, norm="l1")
print(nor_samples)

4. 二值化

用0和1来表示样本矩阵中相对于某个给定阈值高于或者低于它的元素。
sklearn.preprocessing.Binarizer(threshold=阈值) -> 二值化器
二值化器.transform(原始样本矩阵) -> 二值化后的样本矩阵

代码:bin.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [3, -1.5, 2, -5.4],
    [0, 4, -0.3, 2.1],
    [1, 3.3, -1.9, -4.3]
])
print(raw_samples)
bin_samples = raw_samples.copy()
bin_samples[bin_samples <= 1.4] = 0
bin_samples[bin_samples > 1.4] = 1
print(bin_samples)

bin = sp.Binarizer(threshold=1.4)
bin_samples = bin.transform(raw_samples)
print(bin_samples)

5. 独热编码

1 3 2
7 5 4
1 8 6
7 3 9
1:10 3:100 2:1000
7:01 5:010 4:0100
8:001 6:0010
9:0001

1 0 1 0 0 1 0 0 0
0 1 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1 0
0 1 1 0 0 0 0 0 1

sklearn.preprocessing.OneHotEncoder(sparse=是否使用压缩格式, dtpye=元素类型) -> 独热编码器
独热编码器.fit_transform(原始样本矩阵) -> 独热编码后的样本矩阵,同时构建编码表字典
独热编码器.transform(原始样本矩阵) -> 独热编码后的样本矩阵,使用已有的编码表字典

代码:ohe.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [1, 3, 2],
    [7, 5, 4],
    [1, 8, 6],
    [7, 3, 9]
])
print(raw_samples)
code_tables = []
for col in raw_samples.T:
    code_table = {}
    for val in col:
        code_table[val] = None
    code_tables.append(code_table)
for code_table in code_tables:
    size = len(code_table)
    for one, key in enumerate(sorted(code_table.keys())):
        code_table[key] = np.zeros(shape=size, dtype=int)
        code_table[key][one] = 1
ohe_samples = []
for raw_sample in raw_samples:
    ohe_sample = np.array([], dtype=int)
    for i, key in enumerate(raw_sample):
        ohe_sample = np.hstack((ohe_sample, code_tables[i][key]))
    ohe_samples.append(ohe_sample)
ohe_samples = np.array(ohe_samples)
print(ohe_samples)

ohe = sp.OneHotEncoder(sparse=False, dtype=int)
ohe_samples = ohe.fit_transform(raw_samples) # 构建编码表
print(ohe_samples)

new_sample = np.array([1, 5, 6])
ohe_sample = ohe.transform([new_sample]) # 沿用已有的编码表
print(ohe_sample)

6. 标签编码

将字符形式的特征值映射为整数。
sklearn.preprocessing.LabelEncoder() -> 标签编码器
标签编码器.fit_transform(原始样本矩阵) -> 编码样本矩阵,构建编码字典
标签编码器.transform(原始样本矩阵) -> 编码样本矩阵,使用编码字典
标签编码器.inverse_transform(编码样本矩阵) -> 原始样本矩阵,使用编码字典

代码:lab.py

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array(["audi", "ford", "audi", "toyota", "ford", "bmw", "toyota", "ford", "audi"])
print(raw_samples)
lbe = sp.LabelEncoder()
lbe_samples = lbe.fit_transform(raw_samples)
print(lbe_samples)
raw_samples = lbe.inverse_transform(lbe_samples)
print(raw_samples)

6、线性回归

m个输入样本 -> m个输出标签
x1 -> y1
x2 -> y2
x3 -> y3

xm -> ym

xk + b -> y

1. 预测函数

联系输出和输入的数学函数
y = kx + b
其中的k和b称为模型参数,根据已知输入样本和对应的输出标签来训练得出。

2. 均方误差

每一个已知输入样本所对应的实际输出标签和由模型预测出来的输出标签之间的误差平方的平均值。
kx1 + b = y1’
kx2 + b = y2’
kx3 + b = y3’

kxm + b = ym’
( y 1 − y 1 ′ ) 2 + ( y 2 − y 2 ′ ) 2 + ( y 3 − y 3 ′ ) 2 + ⋯ + ( y m − y m ′ ) 2 m \frac {(y1-y1')^2 + (y2-y2')^2 + (y3-y3')^2 + \cdots + (ym-ym')^2} {m} m(y1−y1′)2+(y2−y2′)2+(y3−y3′)2+⋯+(ym−ym′)2​

3. 成本函数

将均方误差看作是关于模型参数的函数,谓之成本函数,记作J(k,b)。
线性回归问题的本质就是寻找能够使成本函数J(k,b)取极小值的模型参数。
J ( k , b ) = ( y 1 − y 1 ′ ) 2 + ( y 2 − y 2 ′ ) 2 + ( y 3 − y 3 ′ ) 2 + ⋯ + ( y m − y m ′ ) 2 m J(k, b) = \frac {(y1-y1')^2 + (y2-y2')^2 + (y3-y3')^2 + \cdots + (ym-ym')^2} {m} J(k,b)=m(y1−y1′)2+(y2−y2′)2+(y3−y3′)2+⋯+(ym−ym′)2​

4. 梯度下降

loss = J(k, b)
machine_learning

5. 接口

sklearn.linear_model.LinearRegression() -> 线性回归器
线性回归器.fit(输入样本, 输出标签)
线性回归器.predict(输入样本) -> 预测输出标签

代码:line.py

import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/single.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y)
model = lm.LinearRegression()
model.fit(x, y)
pred_y = model.predict(x)
print(sm.mean_absolute_error(y, pred_y)) # 平均绝对误差
print(sm.mean_squared_error(y, pred_y)) # 平均平方误差
print(sm.median_absolute_error(y, pred_y)) # 中位数绝对误差
print(sm.r2_score(y, pred_y)) # 综合考虑,兼顾平均值,平方,中位数,且归一化,越接近1越好

mp.figure("Linear Regression", facecolor="lightgray")
mp.title("Linear Regression", fontsize=20)
mp.xlabel("X", fontsize=14)
mp.ylabel("Y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x, y, c="dodgerblue", alpha=0.75, s=60, label="Sample")
sorted_indices = x.T[0].argsort()
mp.plot(x[sorted_indices], pred_y[sorted_indices], "o-", c="orangered", label="Regression")

mp.legend()

mp.show()

6. 复用

通过pickle将内存中的模型对象写入磁盘文件,或从磁盘文件载入内存,以此保存训练好的模型,以备复用。

代码:save.py

import pickle
import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm

x, y = [], []
with open("D:/pythonStudy/AI/notes/single.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y)
model = lm.LinearRegression()
model.fit(x, y)
pred_y = model.predict(x)
print(sm.mean_absolute_error(y, pred_y)) # 平均绝对误差
print(sm.mean_squared_error(y, pred_y)) # 平均平方误差
print(sm.median_absolute_error(y, pred_y)) # 中位数绝对误差
print(sm.r2_score(y, pred_y)) # 综合考虑,兼顾平均值,平方,中位数,且归一化,越接近1越好

with open("./linear.pkl", "wb") as f:
    pickle.dump(model, f)

代码:load.py

import pickle
import numpy as np
import sklearn.metrics as sm

x, y = [], []
with open("D:/pythonStudy/AI/notes/single.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y)
with open("./linear.pkl", "rb") as f:
    model = pickle.load(f)
pred_y = model.predict(x)
print(sm.mean_absolute_error(y, pred_y)) # 平均绝对误差
print(sm.mean_squared_error(y, pred_y)) # 平均平方误差
print(sm.median_absolute_error(y, pred_y)) # 中位数绝对误差
print(sm.r2_score(y, pred_y)) # 综合考虑,兼顾平均值,平方,中位数,且归一化,越接近1越好

7、岭回归

loss = J(k,b) + 正则函数(样本权重)x正则强度/惩罚系数
sklearn.linear_model.Ridge(正则强度, fit_intercept=是否修正截距, max_iter=最大迭代次数) -> 岭回归器

代码:rdg.py

import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/abnormal.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y)
model_ln = lm.LinearRegression()
model_ln.fit(x, y)
pred_y_ln = model_ln.predict(x)
model_rd = lm.Ridge(300, fit_intercept=True, max_iter=10000)
model_rd.fit(x, y)
pred_y_rd = model_rd.predict(x)

mp.figure("Ridge Regression", facecolor="lightgray")
mp.title("Ridge Regression", fontsize=20)
mp.xlabel("X", fontsize=14)
mp.ylabel("Y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x, y, c="dodgerblue", alpha=0.75, s=60, label="Sample")
sorted_indices = x.T[0].argsort()
mp.plot(x[sorted_indices], pred_y_ln[sorted_indices], "o-", c="orangered", label="Linear")
mp.plot(x[sorted_indices], pred_y_rd[sorted_indices], "o-", c="limegreen", label="Ridge")

mp.legend()

mp.show()

8、欠拟合和过拟合

欠拟合:无论是训练数据还是测试数据,模型给出的预测值和真实值都存在较大的误差。
过拟合:模型对于训练数据具有较高的精度,但对测试数据则表现极差。模型过于特殊,不够一般(泛化)。
欠拟合<——模型复杂度——>过拟合

9、多项式回归

x -> y y = kx + b
x x 2 − > y x\quad x^2 -> y xx2−>y y = k 1 x 2 + k 2 x + b y = k1x^2 + k2x + b y=k1x2+k2x+b
x x 2 x 3 − > y x\quad x^2\quad x^3 -> y xx2x3−>y y = k 1 x 3 + k 2 x 2 + k 3 x + b y = k1x^3 + k2x^2 + k3x + b y=k1x3+k2x2+k3x+b

sklearn.preprocessing.PolynomialFeatures(最高次数) -> 多项式特征扩展器
sklearn.pipeline.make_pipe(多项式特征扩展器, 线性回归器) -> 多项式回归器
x − − > 多 项 式 特 征 扩 展 器 − − > x , x 2 , x 3 . . . − − > 线 性 回 归 器 − − > k 1 , k 2 , k 3 − − > x-->多项式特征扩展器-->x,x^2,x^3...-->线性回归器-->k1,k2,k3--> x−−>多项式特征扩展器−−>x,x2,x3...−−>线性回归器−−>k1,k2,k3−−>

代码:poly.py

import numpy as np
import sklearn.pipeline as pl
import sklearn.preprocessing as sp
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

N = 10
train_x, train_y = [], []
with open("D:/pythonStudy/AI/notes/single.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        train_x.append(data[:-1])
        train_y.append(data[-1])
train_x = np.array(train_x)
train_y = np.array(train_y)
model = pl.make_pipeline(sp.PolynomialFeatures(N), lm.LinearRegression())
model.fit(train_x, train_y)

pred_train_y = model.predict(train_x)
print(sm.r2_score(train_y, pred_train_y))
test_x = np.linspace(train_x.min(), train_x.max(), 1000)[:, np.newaxis]
pred_test_y = model.predict(test_x)

mp.figure("Polynomial Regression", facecolor="lightgray")
mp.title("Polynomial Regression", fontsize=20)
mp.xlabel("X", fontsize=14)
mp.ylabel("Y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(train_x, train_y, c="dodgerblue", alpha=0.75, s=60, label="Sample")
mp.plot(test_x, pred_test_y, c="orangered", label="Regression")

mp.legend()

mp.show()

10、决策树

相似的输入会有相似的输出。

0-专科 0-普通 0-女 0-差 0-低
1-本科 1-985 1-男 1-及格 1-中
2-硕士 2-211 2-良好 2-高
3-博士 3-优异
学历 院校 性别 成绩 月薪
1 0 1 2 8000
0 0 0 2 7000
3 1 1 3 20000
1 1 0 1 ?

回归 - 平均
分类 - 投票
machine_learning
优化:

  1. 结合业务优先选择有限的主要特征,划分子表,降低决策树的高度;
  2. 根据香农定理计算根据每一个特征划分子表前后的信息熵差,选择熵减少量最大的特征,优先参与子表划分;
  3. 集合算法:根据不同的方法,构建多个决策树,利用他们的预测结果,按照取均值或投票的方法产生最终的预测结果。
    A. 自助聚合:采用有放回的抽样规则,从m个样本中随机抽取n个样本,构建一棵决策树,重复以上过程b次,得到b棵决策树。利用每棵决策树的预测结果,根据平均或者投票得到最终预测结果。
    B. 随机森林:在自助聚合算法的基础上更进一步,对特征也应用自助聚合,即每次训练时,不是用所有的特征来构建树结构,而是随机选择部分特征参与构建,以此避免特殊特征对预测结果的影响。
    C. 正向激励:初始化时,针对m个样本分配初始权重,然后根据这个带有权重的模型预测训练样本,针对那些预测错误的样本,提高其权重,再构建一棵决策树模型,重复以上过程,得到b棵决策树。

sklearn.tree.DecisionTreeRegressor() -> 决策树回归器
sklearn.ensemble.AdaBoostRegressor(元回归器, n_estimators=评估器数, random_state=随机种子源) -> 正向激励回归器
sklearn.ensemble.RandomForestRegressor(max_depth=最大树高, n_estimators=评估器数, min_samples_split=划分子表的最小样本数) -> 随机森林回归器

代码:house.py

import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import sklearn.metrics as sm

housing = sd.load_boston()
# print(housing.feature_names)
# print(housing.data.shape) # 样本
# print(housing.target.shape) # 输出
x, y = su.shuffle(housing.data, housing.target, random_state=7) # 同步打乱输入和输出数组
train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = x[:train_size], x[train_size:], y[:train_size], y[train_size:]

model = st.DecisionTreeRegressor(max_depth=4) # max_depth 最大树高
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),
    n_estimators=400,
    random_state=7
)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
for test, pred_test in zip(test_y, pred_test_y):
    print(test, "-->", pred_test)

决策树模型.feature_importances_ : 特征重要性
代码:fi.py

import numpy as np
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import matplotlib.pyplot as mp

housing = sd.load_boston()
feature_names = housing.feature_names
x, y = su.shuffle(housing.data, housing.target, random_state=7) # 同步打乱输入和输出数组
train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = x[:train_size], x[train_size:], y[:train_size], y[train_size:]

model = st.DecisionTreeRegressor(max_depth=4) # max_depth 最大树高
model.fit(train_x, train_y)
fi_dt = model.feature_importances_


model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),
    n_estimators=400,
    random_state=7
)
model.fit(train_x, train_y)
fi_ab = model.feature_importances_

mp.figure("Feature Importance", facecolor="lightgray")
mp.subplot(211)
mp.title("Decision Tree", fontsize=16)
mp.ylabel("Importance", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis="y", linestyle=":")
sorted_indices = fi_dt.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(
    pos, fi_dt[sorted_indices],
    facecolor="deepskyblue",
    edgecolor="steelblue",
)
mp.xticks(pos, feature_names[sorted_indices], rotation=30)

mp.subplot(212)
mp.title("AdaBoost Decision Tree", fontsize=16)
mp.xlabel("Feature", fontsize=12)
mp.ylabel("Importance", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis="y", linestyle=":")
sorted_indices = fi_ab.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(
    pos, fi_ab[sorted_indices],
    facecolor="lightcoral",
    edgecolor="indianred",
)
mp.xticks(pos, feature_names[sorted_indices], rotation=30)

mp.tight_layout()
mp.get_current_fig_manager().window.state("zoomed") # 窗口最大化
mp.show()

代码:bike.py

import csv
import numpy as np
import sklearn.utils as su
import sklearn.ensemble as se
import sklearn.metrics as sm
import matplotlib.pyplot as mp

def read_data(filename):
    with open(filename, "r") as f:
        reader = csv.reader(f)
        x, y = [], []
        for row in reader:
            x.append(row[2:13])
            y.append(row[-1])
    feature_name = np.array(x[0])
    x = np.array(x[1:], dtype=float)
    y = np.array(y[1:], dtype=float)
    x, y = su.shuffle(x, y, random_state=7)
    return x, y, feature_name

def fit_pred_score(x, y):
    train_size = int(len(x) * 0.9)
    train_x, test_x, train_y, test_y = x[:train_size], x[train_size:], y[:train_size], y[train_size:]

    model = se.RandomForestRegressor(max_depth=10, n_estimators=1000, min_samples_split=2)
    model.fit(train_x, train_y)
    feature_name = model.feature_importances_
    pred_test_y = model.predict(test_x)
    r2_score = sm.r2_score(test_y, pred_test_y)
    return r2_score, feature_name

def plot_pic(subNum, title, y, fc, ec, feature):
    mp.subplot(subNum)
    mp.title(title, fontsize=16)
    mp.ylabel("Importance", fontsize=12)
    mp.tick_params(labelsize=10)
    mp.grid(axis="y", linestyle=":")
    sorted_indices = y.argsort()[::-1]
    x = np.arange(sorted_indices.size)
    mp.bar(
        x, y[sorted_indices],
        facecolor=fc,
        edgecolor=ec,
    )
    mp.xticks(x, feature[sorted_indices], rotation=30)
    mp.legend()

if __name__ == "__main__":
    xd, yd, fn_dyd = read_data("D:/pythonStudy/AI/notes/bike_day.csv")
    dr2_score, fi_dyd = fit_pred_score(xd, yd)
    xh, yh, fn_dyh = read_data("D:/pythonStudy/AI/notes/bike_hours.csv")
    hr2_score, fi_dyh = fit_pred_score(xh, yh)
    mp.figure("Feature Importance", facecolor="lightgray")
    plot_pic(
        y=fi_dyd,
        subNum=211,
        title="Bike Of Day",
        fc="deepskyblue",
        ec="steelblue",
        feature=fn_dyd
    )
    plot_pic(
        y=fi_dyh,
        subNum=212,
        title="Bike Of Hours",
        fc="lightcoral",
        ec="indianred",
        feature=fn_dyh
    )
    mp.tight_layout()
    mp.show()

三、简单分类器

输入 输出
3 1 -> 0
2 5 -> 1
1 8 -> 1
6 4 -> 0
5 2 -> 0
3 5 -> 1
4 7 -> 1
4 -1 -> 0
7 5 -> ?

代码:simple.py

import numpy as np
import matplotlib.pyplot as mp

x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]
])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.05
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.05
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = np.zeros(len(flat_x), dtype=int)
flat_y[flat_x[:, 0] < flat_x[:, 1]] = 1
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Simple Classfication", facecolor="lightgray")
mp.title("Simple Classfication", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"

)
mp.scatter(
    x[:, 0], x[:, 1],
    c=y,
    cmap="brg",
    s=60
)
mp.show()

四、逻辑分类

1. 预测函数

g ( z ) = 1 1 + e − z g(z)=\frac {1} {1 + e^{-z}} g(z)=1+e−z1​
z = k 1 x 1 + k 2 x 2 + b z = k1x1 + k2x2 + b z=k1x1+k2x2+b
machine_learning

2. 成本函数

交叉熵误差
J ( k 1 , k 2 , b ) = σ ( − y l o g 2 y ′ − ( 1 − y ) l o g 2 ( 1 − y ′ ) ) m + J(k1, k2, b) = \frac {\sigma(-ylog_2y' - (1-y)log_2(1-y') )} {m} + J(k1,k2,b)=mσ(−ylog2​y′−(1−y)log2​(1−y′))​+ 正则函数(||k1,k2,b||) x 正则强度
sklearn.linear_model.LogisticRegression(solver=“liblinear”, C=正则强度)

代码:log.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm

x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]
])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
model = lm.LogisticRegression(solver="liblinear", C=1)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.05
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.05
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Logistic Classfication", facecolor="lightgray")
mp.title("Logistic Classfication", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"

)
mp.scatter(
    x[:, 0], x[:, 1],
    c=y,
    cmap="brg",
    s=60
)
mp.show()

多元分类

A B C
-> A 1 0.9 0.1 0.3 A
-> B 0 0.3 0.6 0.4 B
-> C 0 0.1 0.2 0.6 C

代码:mlog.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm

x = np.array([
    [4, 7],
    [3.5, 8],
    [3.1, 6.2],
    [0.5, 1],
    [1, 2],
    [1.2, 1.9],
    [6, 2],
    [5.7, 1.5],
    [5.4, 2.2]
])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
model = lm.LogisticRegression(solver="liblinear", C=100)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Logistic Classfication", facecolor="lightgray")
mp.title("Logistic Classfication", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"

)
mp.scatter(
    x[:, 0], x[:, 1],
    c=y,
    cmap="brg",
    s=60
)
mp.show()

五、朴素贝叶斯分类

x x … x -> 0
x x … x -> 1
x x … x -> 0
x x … x -> 0
x x … x -> 1
x x … x -> 2
x x … x -> 1
x x … x -> 0
x x … x -> 2

x x … x -> 0 0.8
x x … x -> 0 0.9 *
x x … x -> 0 0.7

1. 贝叶斯定理

P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B) = \frac {P(A)P(B|A)} {P(B)} P(A∣B)=P(B)P(A)P(B∣A)​
当B事件发生时,A事件发生的概率

2.朴素贝叶斯分类

求X样本属于C类别的概率,即当观察到X样本出现时,其所属的类别为C的概率。
P ( C ∣ X ) = P ( C ) ⋅ P ( X ∣ C ) P ( X ) P(C|X) = \frac {P(C) \cdot P(X|C)} {P(X)} P(C∣X)=P(X)P(C)⋅P(X∣C)​
P ( C ) ⋅ P ( X ∣ C ) = P ( C , X ) = P ( C , x 1 , x 2 , . . . , x n ) = P ( x 1 , x 2 , . . . , x n , C ) P(C) \cdot P(X|C) = P(C,X) = P(C,x1,x2,...,xn) = P(x1,x2,...,xn,C) P(C)⋅P(X∣C)=P(C,X)=P(C,x1,x2,...,xn)=P(x1,x2,...,xn,C)
= P ( x 1 ∣ x 2 , . . . , x n , C ) ⋅ P ( x 2 , . . . , x n , C ) =P(x1|x2,...,xn,C) \cdot P(x2,...,xn,C) =P(x1∣x2,...,xn,C)⋅P(x2,...,xn,C)
= P ( x 1 ∣ x 2 , . . . , x n , C ) ⋅ P ( x 2 ∣ x 3 , . . . , x n , C ) ⋅ P ( x 3 , . . . , x n , C ) =P(x1|x2,...,xn,C) \cdot P(x2|x3,...,xn,C) \cdot P(x3,...,xn,C) =P(x1∣x2,...,xn,C)⋅P(x2∣x3,...,xn,C)⋅P(x3,...,xn,C)
= P ( x 1 ∣ x 2 , . . . , x n , C ) ⋅ P ( x 2 ∣ x 3 , . . . , x n , C ) ⋅ P ( x 3 ∣ x 4 , . . . , x n , C ) ⋯ P ( C ) =P(x1|x2,...,xn,C) \cdot P(x2|x3,...,xn,C) \cdot P(x3|x4,...,xn,C) \cdots P(C) =P(x1∣x2,...,xn,C)⋅P(x2∣x3,...,xn,C)⋅P(x3∣x4,...,xn,C)⋯P(C)
∵ \because \quad ∵ 朴素:条件独立假设,即样本各个特征之间并无关联,不构成条件约束。
∴ = P ( x 1 ∣ C ) ⋅ P ( x 2 ∣ C ) ⋅ P ( x 3 ∣ C ) ⋯ P ( C ) \therefore \quad =P(x1|C) \cdot P(x2|C) \cdot P(x3|C) \cdots P(C) ∴=P(x1∣C)⋅P(x2∣C)⋅P(x3∣C)⋯P(C)
∴ \therefore \quad ∴X样本属于C类别的概率,正比于C类别的概率乘以C类别条件下X样本中每一个特征值出现的概率之乘积

代码:nb.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.naive_bayes as nb

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple1.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)

model = nb.GaussianNB()
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Naive Bayes Classfication", facecolor="lightgray")
mp.title("Naive Bayes Classfication", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"
)
mp.scatter(
    x[:, 0], x[:, 1],
    c=y,
    cmap="brg",
    s=60
)
mp.show()

3. 划分训练集和测试集

sklearn.model_selection.train_test_split(输入集合,输出集合,test_size=测试集占比,random_state=随机种子源) -> 训练输入,测试输入,训练输出,测试输出

代码:split.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.naive_bayes as nb

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple1.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)

train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y,
    test_size=0.25,
    random_state=7
)
model = nb.GaussianNB()
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
true_rate = (pred_test_y == test_y).sum() / pred_test_y.size # 正确率
print(true_rate)

4. 交叉验证

4.1 查准率和召回率

machine_learning

查准率: 被 正 确 识 别 为 某 类 别 的 样 本 数 被 识 别 为 该 类 别 的 样 本 数 \frac {被正确识别为某类别的样本数} {被识别为该类别的样本数} 被识别为该类别的样本数被正确识别为某类别的样本数​
正确性,对不对
召回率: 被 正 确 识 别 为 某 类 别 的 样 本 数 该 类 别 的 实 际 样 本 数 \frac {被正确识别为某类别的样本数} {该类别的实际样本数} 该类别的实际样本数被正确识别为某类别的样本数​
完整性,全不全
f 1 _ s c o r e = 2 × 查 准 率 × 召 回 率 查 准 率 + 召 回 率 f1\_score=\frac {2\times 查准率 \times 召回率} {查准率 + 召回率} f1_score=查准率+召回率2×查准率×召回率​
0 < — > 1
差 好

sklearn.model_selection.cross_val_score(分类器,输入集合,输出集合,cv=验证次数,scoring=验证指标名称) -> 验证指标值数组
ms.cross_val_score(model,x,y,cv=5,scoring=“f1_weighted”) -> [0.6 0.8 0.4 0.7 0.6]

代码:cv.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.naive_bayes as nb

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple1.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)

train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y,
    test_size=0.25,
    random_state=7
)
model = nb.GaussianNB()
# 确定算法
prec_cross_score = ms.cross_val_score(model, x, y, cv=5, scoring="precision_weighted")
recall_cross_score = ms.cross_val_score(model, x, y, cv=5, scoring="recall_weighted")
f1_cross_score = ms.cross_val_score(model, x, y, cv=5, scoring="f1_weighted")
print(prec_cross_score.mean())
print(recall_cross_score.mean())
print(f1_cross_score.mean())

5. 混淆矩阵

以实际类别为行,以预测类别为列。
0 1 2 0 45 4 3 1 11 56 2 2 5 6 49 \begin{matrix} & 0 & 1 & 2 \\ 0 &45 & 4 & 3 \\ 1 &11 &56 & 2 \\ 2 & 5 & 6 &49 \\ \end{matrix} 012​045115​14566​23249​
sklearn.metrics.confusion_matrix(实际输出, 预测输出) -> 混淆矩阵

代码:cm.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.naive_bayes as nb
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple1.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)

train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y,
    test_size=0.25,
    random_state=7
)
model = nb.GaussianNB()
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
cm = sm.confusion_matrix(test_y, pred_test_y)
print(cm)

l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Naive Bayes Classfication", facecolor="lightgray")
mp.title("Naive Bayes Classfication", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"
)
mp.scatter(
    test_x[:, 0], test_x[:, 1],
    c=test_y,
    cmap="brg",
    s=60
)

mp.figure("Confusion Matrix", facecolor="lightgray")
mp.title("Confusion Matrix", fontsize=20)
mp.xlabel("Predicted Class", fontsize=14)
mp.ylabel("True Class", fontsize=14)
mp.tick_params(labelsize=10)
mp.imshow(cm, interpolation="nearest", cmap="jet")
mp.show()

6. 分类报告

sklearn.metrics.classification_report(实际输出, 预测输出) -> 分类报告

代码:cr.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.naive_bayes as nb
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple1.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)

train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y,
    test_size=0.25,
    random_state=7
)
model = nb.GaussianNB()
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
cm = sm.confusion_matrix(test_y, pred_test_y)
print(cm)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)

六、随机森林分类

1. 评估汽车档次

代码:car.py

import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms

data = []
with open("D:/pythonStudy/AI/notes/car.txt", "r") as f:
    for line in f.readlines():
        data.append(line[:-1].split(","))
data = np.array(data).T
encoders, train_x = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        train_x.append(encoder.fit_transform(data[row]))
    else:
        train_y = encoder.fit_transform(data[row])
    encoders.append(encoder)
train_x = np.array(train_x).T
model = se.RandomForestClassifier(max_depth=8, n_estimators=200, random_state=7)
f1_score = ms.cross_val_score(model, train_x, train_y, cv=5, scoring="f1_weighted").mean()
print(f1_score)

model.fit(train_x, train_y)
data = [
    ["high", "med", "5more", "4", "big", "low", "unacc"],
    ["high", "high", "4", "4", "med", "med", "acc"],
    ["low", "low", "2", "4", "small", "high", "good"],
    ["low", "med", "3", "4", "med", "high", "vgood"]
]
data = np.array(data).T
test_x = []
for row in range(len(data)):
    encoder = encoders[row]
    if row < len(data) - 1:
        test_x.append(encoder.transform(data[row]))
    else:
        test_y = encoder.transform(data[row])
test_x = np.array(test_x).T
pred_test_y = model.predict(test_x)
print(encoders[-1].inverse_transform(pred_test_y))

2. 验证曲线

f1_score = f(模型对象超参数)
验证曲线的峰值,寻找相对理想的超参数。
model = se.RandomForestClassifier(max_depth=8, n_estimators=200, random_state=7)
model = se.RandomForestClassifier(max_depth=8, random_state=7)
sklearn.model_selection.validation_curve(model, x, y, “n_estimators”, [100,200,300,…], cv=5) -> 训练集得分矩阵, 测试集得分矩阵
$
\begin{matrix}
& 1 & 2 & 3 & 4 & 5 \
100&0.7&0.9&0.6&0.8&0.7 \
200& \
300& \
…&
\end{matrix}
$

代码:vc.py

import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms
import matplotlib.pyplot as mp

data = []
with open("D:/pythonStudy/AI/notes/car.txt", "r") as f:
    for line in f.readlines():
        data.append(line[:-1].split(","))
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
model = se.RandomForestClassifier(max_depth=8, random_state=7)
n_estimators = np.linspace(20, 200, 10).astype(int)
train_scores1, test_scores1 = ms.validation_curve(
    estimator=model,
    X=x,
    y=y,
    param_name="n_estimators",
    param_range=n_estimators,
    cv=5
)
train_means1 = train_scores1.mean(axis=1) # axis=1 按行
train_std1 = train_scores1.std(axis=1)
test_means1 = test_scores1.mean(axis=1) # axis=1 按行
test_std1 = test_scores1.std(axis=1)

model = se.RandomForestClassifier(n_estimators=140, random_state=7)
max_depth = np.linspace(1, 10, 11).astype(int)
train_scores2, test_scores2 = ms.validation_curve(
    estimator=model,
    X=x,
    y=y,
    param_name="max_depth",
    param_range=max_depth,
    cv=5
)
train_means2 = train_scores2.mean(axis=1) # axis=1 按行
train_std2 = train_scores2.std(axis=1)
test_means2 = test_scores2.mean(axis=1) # axis=1 按行
test_std2 = test_scores2.std(axis=1)

mp.figure("Validation Curve", facecolor="lightgray")
mp.subplot(121)
mp.title("Validation Curve", fontsize=16)
mp.xlabel("n_estimators", fontsize=12)
mp.ylabel("f1_score", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.fill_between(
    n_estimators,
    train_means1 - train_std1,
    train_means1 + train_std1,
    color="dodgerblue",
    alpha=0.25
)
mp.fill_between(
    n_estimators,
    test_means1 - test_std1,
    test_means1 + test_std1,
    color="orangered",
    alpha=0.25
)
mp.plot(
    n_estimators,
    train_means1,
    "o-",
    c="dodgerblue",
    label="Training",
)
mp.plot(
    n_estimators,
    test_means1,
    "o-",
    c="orangered",
    label="Testing"
)
mp.legend()

mp.subplot(122)
mp.title("Validation Curve", fontsize=16)
mp.xlabel("max_depth", fontsize=12)
mp.ylabel("f1_score", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.fill_between(
    max_depth,
    train_means2 - train_std2,
    train_means2 + train_std2,
    color="dodgerblue",
    alpha=0.25
)
mp.fill_between(
    max_depth,
    test_means2 - test_std2,
    test_means2 + test_std2,
    color="orangered",
    alpha=0.25
)
mp.plot(
    max_depth,
    train_means2,
    "o-",
    c="dodgerblue",
    label="Training"
)
mp.plot(
    max_depth,
    test_means2,
    "o-",
    c="orangered",
    label="Testing"
)
mp.legend()
mp.tight_layout()

mp.show()

3. 学习曲线

f1_score=f(训练集大小)
sklearn.model_selection.learning_curve(model, x, y, 训练集大小数组, cv=5) -> 训练集大小数组,训练集得分矩阵, 测试集得分矩阵
用于选择合适的训练集大小。

代码:lc.py

import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms
import matplotlib.pyplot as mp

data = []
with open("D:/pythonStudy/AI/notes/car.txt", "r") as f:
    for line in f.readlines():
        data.append(line[:-1].split(","))
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
model = se.RandomForestClassifier(max_depth=9, n_estimators=140, random_state=7)
train_sizes = np.linspace(100, 1000, 10).astype(int)
# train_sizes 训练集的大小
train_sizes, train_scores, test_scores = ms.learning_curve(
    estimator=model,
    X=x,
    y=y,
    train_sizes=train_sizes,
    cv=5
)

train_means = train_scores.mean(axis=1) # axis=1 按行
train_std = train_scores.std(axis=1)
test_means = test_scores.mean(axis=1) # axis=1 按行
test_std = test_scores.std(axis=1)

mp.figure("Learning Curve", facecolor="lightgray")
mp.title("Learning Curve", fontsize=16)
mp.xlabel("Train size", fontsize=12)
mp.ylabel("f1_score", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.fill_between(
    train_sizes,
    train_means - train_std,
    train_means + train_std,
    color="dodgerblue",
    alpha=0.25
)
mp.fill_between(
    train_sizes,
    test_means - test_std,
    test_means + test_std,
    color="orangered",
    alpha=0.25
)
mp.plot(
    train_sizes,
    train_means,
    "o-",
    c="dodgerblue",
    label="Training",
)
mp.plot(
    train_sizes,
    test_means,
    "o-",
    c="orangered",
    label="Testing"
)
mp.legend()
mp.tight_layout()

mp.show()

七、支持向量机(SVM)

1. 分类边界

machine_learning

同时满足四个条件:

  1. 正确分类
  2. 支持向量到分类边界的距离相等
  3. 间距最大
  4. 线性(直线,平面)

2. 升维变换

machine_learning

对于在低维度空间中无法线性划分的样本,通过升维变换,在高维度空间寻找最佳线性分类边界。
核函数:用于对特征值进行升维变换的函数。
多项式函数 kernel=“poly”
径向基函数 kernel=“rbf” C=600 gamma=0.01

代码:svm_line.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple2.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=7)
model = svm.SVC(kernel="linear")
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))

mp.figure("SVM Linear Classification", facecolor="lightgray")
mp.title("SVM Linear Classification", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray"
)
C0, C1 = y == 0, y == 1
mp.scatter(
    x[C0][:, 0],
    x[C0][:, 1],
    c="orangered",
    s=60
)
mp.scatter(
    x[C1][:, 0],
    x[C1][:, 1],
    c="limegreen",
    s=60
)
mp.show()
  • 当不同类别的样本数数量相差悬殊时,样本数较少的类别可能被支持向量机分类器忽略,为此可以将class_weight参数指定为balanced,通过调节不同类别样本的权重均衡化。

3. 置信概率

svm.SVC(…, probability=True)
支持向量机分类器.predict_proba(输入样本) -> 置信概率矩阵

类别1 类别2
样本1 0.99 0.01
样本2 0.02 0.98

代码:svm_prob.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple2.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=7)
model = svm.SVC(kernel="rbf", C=600, gamma=0.01, probability=True)
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))
prob_x = np.array([
    [2, 1.5],
    [8, 9],
    [4.8, 5.2],
    [4, 4],
    [2.5, 7],
    [7.6, 2],
    [5.4, 5.9]
])
print(prob_x)
pred_prob_y = model.predict(prob_x)
print(pred_prob_y)
probs = model.predict_proba(prob_x)
print(probs)

mp.figure("SVM RBF Classification", facecolor="lightgray")
mp.title("SVM RBF Classification", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray"
)
C0, C1 = y == 0, y == 1
mp.scatter(
    x[C0][:, 0],
    x[C0][:, 1],
    c="orangered",
    s=60
)
mp.scatter(
    x[C1][:, 0],
    x[C1][:, 1],
    c="limegreen",
    s=60
)

C0, C1 = pred_prob_y == 0, pred_prob_y == 1
mp.scatter(
    prob_x[C0][:, 0],
    prob_x[C0][:, 1],
    marker="D",
    c="dodgerblue",
    s=60
)
mp.scatter(
    prob_x[C1][:, 0],
    prob_x[C1][:, 1],
    marker="D",
    c="deeppink",
    s=60
)

for i in range(len(probs[C0])):
    mp.annotate(
        "{}% {}%".format(round(probs[C0][:, 0][i] * 100, 2), round(probs[C0][:, 1][i] * 100, 2)),
        xy=(prob_x[C0][:, 0][i], prob_x[C0][:, 1][i]),
        xytext=(12, -12),
        textcoords="offset points",
        horizontalalignment="left",
        verticalalignment="top",
        fontsize=9,
        bbox={
            "boxstyle": "round, pad=0.6",
            "fc": "deepskyblue",
            "alpha": 0.8
        }
    )
for i in range(len(probs[C1])):
    mp.annotate(
        "{}% {}%".format(round(probs[C1][:, 0][i] * 100, 2), round(probs[C1][:, 1][i] * 100, 2)),
        xy=(prob_x[C1][:, 0][i], prob_x[C1][:, 1][i]),
        xytext=(12, -12),
        textcoords="offset points",
        horizontalalignment="left",
        verticalalignment="top",
        fontsize=9,
        bbox={
            "boxstyle": "round, pad=0.6",
            "fc": "violet",
            "alpha": 0.8
        }
    )

mp.show()

4. 最优超参数

sklearn.model_selection.GridSearchCV(模型, 参数组合表, cv=交叉验证次数) -> 最优模型对象
参数组合表:[{参数名:[取值列表]}, {}, {}, …,]

代码:bhp.py

import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm

x, y = [], []
with open("D:/pythonStudy/AI/notes/multiple2.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=7)
params = [
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
    {"kernel": ["poly"], "C": [1], "degree": [2, 3]},
    {"kernel": ["rbf"], "C": [1, 10, 100, 1000], "gamma": [1, 0.1, 0.01, 0.001]}
]
model = ms.GridSearchCV(svm.SVC(probability=True), params, cv=5)
model.fit(train_x, train_y)
for param, score in zip(model.cv_results_["params"], model.cv_results_["mean_test_score"]):
    print(param, score)
print(model.best_params_)

事件预测
代码:evt.py

import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm

class DigitEncoder:
    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)

data = []
# with open("D:/pythonStudy/AI/notes/event.txt", "r") as f:
with open("D:/pythonStudy/AI/notes/events.txt", "r") as f:
    for line in f.readlines():
        data.append(line[:-1].split(","))
data = np.delete(np.array(data).T, 1, 0)
encoders, x = [], []
for row in range(len(data)):
    if data[row, 0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=5)
model = svm.SVC(kernel="rbf", class_weight="balanced")
print(ms.cross_val_score(model, x, y, cv=3, scoring="f1_weighted").mean())
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / pred_test_y.size)
print(sm.confusion_matrix(test_y, pred_test_y))
print(sm.classification_report(test_y, pred_test_y))
data = [["Tuesday", "12:30:00", "21", "23"]]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))

5.事件预测

2 -> 4
3 -> 6
4 -> 8

y = kx + b
利用支持向量机回归模型预测交通流量

代码:trf.py

import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm

class DigitEncoder:
    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)

data = []
with open("D:/pythonStudy/AI/notes/traffic.txt", "r") as f:
    for line in f.readlines():
        data.append(line[:-1].split(","))
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    if data[row, 0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.25, random_state=5)
model = svm.SVR(kernel="rbf", C=10)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
data = [["Tuesday", "13:35", "San Francisco", "yes"]]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))

八、聚类

1. K均值

machine_learning

根据事先给定的聚类数,为每个聚类随机分配中心点,计算所有样本与各个中心点的距离,将每个样本分配到与其距离最近的中心点所在的聚类中。计算每个聚类的几何中心,用该几何中心作为新的聚类中心,重新划分聚类。直到计算出的几何中心与上一次聚类使用的聚类中心重合或者足够接近为止。

  • 聚类数必须事先已知:从业务中找,选择最优化指标。
  • 聚类结果会受样本比例的影响。
  • 聚类中心的初始位置会影响聚类结果。

代码:km.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc

x = []
with open("D:/pythonStudy/AI/notes/multiple3.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data)
x = np.array(x)
model = sc.KMeans(init="k-means++", n_clusters=4)
model.fit(x)
centers = model.cluster_centers_

l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.01
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.01
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_y = model.predict(x)

mp.figure("K-Means Cluster", facecolor="lightgray")
mp.title("K-Means Cluster", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"

)
mp.scatter(
    x[:, 0], x[:, 1],
    c=pred_y,
    cmap="brg",
    s=60
)
mp.scatter(
    centers[:, 0],
    centers[:, 1],
    marker="+",
    c="gold",
    s=1000,
    linewidths=1
)

mp.show()

2. 图像量化

代码:quant.py

import numpy as np
import imageio as im
import sklearn.cluster as sc
import matplotlib.pyplot as mp

# as_gray=True取灰度图片(二维数组)
image = im.imread("D:/pythonStudy/AI/notes/lily.jpg", as_gray=True).astype(np.uint8)
x = image.reshape(-1, 1)
model = sc.KMeans(n_clusters=4)
model.fit(x)
y = model.labels_
centers = model.cluster_centers_.squeeze()
z = centers[y]
quant = z.reshape(image.shape)

mp.figure("Original Image", facecolor="lightgray")
mp.title("Original Image", fontsize=20)
mp.axis("off")
mp.imshow(image, cmap="gray")
mp.figure("Quant Image", facecolor="lightgray")
mp.title("Quant Image", fontsize=20)
mp.axis("off")
mp.imshow(quant, cmap="gray")
mp.show()

3. 均值漂移

machine_learning
machine_learning

把训练样本看成服从某种概率密度函数规则的随机分布,在不断迭代的过程中试图寻找最佳的模式匹配,该密度函数的峰值点就是聚类的中心,为该密度函数所覆盖的样本即隶属于该聚类。
不需要事先给定聚类数,算法本身具有发现聚类数量的能力。

代码:shift.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc

x = []
with open("D:/pythonStudy/AI/notes/multiple3.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data)
x = np.array(x)
bw = sc.estimate_bandwidth(x, n_samples=len(x), quantile=0.1) # 以0.1作为带宽宽度
model = sc.MeanShift(bandwidth=bw, bin_seeding=True) # 次要中心忽略
model.fit(x)
centers = model.cluster_centers_

l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.01
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.01
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_y = model.predict(x)

mp.figure("MeanShift Cluster", facecolor="lightgray")
mp.title("MeanShift Cluster", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
# 颜色绘制网格
mp.pcolormesh(
    grid_x[0],
    grid_x[1],
    grid_y,
    cmap="gray",
    shading="auto"

)
mp.scatter(
    x[:, 0], x[:, 1],
    c=pred_y,
    cmap="brg",
    s=60
)
mp.scatter(
    centers[:, 0],
    centers[:, 1],
    marker="+",
    c="gold",
    s=1000,
    linewidths=1
)

mp.show()

4. 凝聚层次

machine_learning

凝聚层次聚类,可以是自下而上(聚),也可以是自上而下(分)的。在自下而上的算法中,每个训练样本都被看作是一个单独的集群,根据样本之间的相似度,将其不断合并,直到集群数量达到事先指定的聚类数为止。在自上而下的算法中,所有训练样本被看作是一个大的聚类,根据样本之间的差异度,将其不断拆分,知道集群数量达到指定的聚类数为止。

代码:agglo.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc

x = []
with open("D:/pythonStudy/AI/notes/multiple3.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data)
x = np.array(x)
model = sc.AgglomerativeClustering(n_clusters=4)
pred_y = model.fit_predict(x)
# 无中心,训练和预测同时完成了

mp.figure("Agglomerative Cluster", facecolor="lightgray")
mp.title("Agglomerative Cluster", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")

mp.scatter(
    x[:, 0], x[:, 1],
    c=pred_y,
    cmap="brg",
    s=60
)

mp.show()

machine_learning

凝聚层次算法,不同于其他基于中心的聚类算法,用它对一些在空间上具有明显连续性,但彼此间的距离未必最近的样本,可以优先聚集,这样所构成的聚类划分就能够表现出较强的连续性。

代码:spiral.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc
import sklearn.neighbors as nb

n_samples = 500
t = 2.5 * np.pi * (1 + 2 * np.random.rand(n_samples, 1))
x = 0.05 * t * np.cos(t)
y = 0.05 * t * np.sin(t)
n = 0.05 * np.random.rand(n_samples, 2)
x = np.hstack((x, y)) + n

model_nonc = sc.AgglomerativeClustering(linkage="average", n_clusters=3)
pred_y_nonc = model_nonc.fit_predict(x)
conn = nb.kneighbors_graph(x, 10, include_self=False)
model_conn = sc.AgglomerativeClustering(linkage="average", n_clusters=3, connectivity=conn)
pred_y_conn = model_conn.fit_predict(x)

mp.figure("NoneConnectivity", facecolor="lightgray")
mp.title("NoneConnectivity", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")

mp.scatter(
    x[:, 0], x[:, 1],
    c=pred_y_nonc,
    cmap="brg",
    s=60
)
mp.axis("equal")
mp.figure("Connectivity", facecolor="lightgray")
mp.title("Connectivity", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")

mp.scatter(
    x[:, 0], x[:, 1],
    c=pred_y_conn,
    cmap="brg",
    s=60
)
mp.axis("equal")
mp.show()

5. DBSCAN

machine_learning

“朋友的朋友也是朋友”
从任何一个训练样本出发,以一个事先给定的半径做圆,凡是不在此圆之外的样本都与圆心样本同类,再以这些同类样本为中心做圆重复以上过程,直到没有新的同类样本加入聚类为止。以此类推,获得样本空间中的所有聚类。那些不属于任何聚类的样本,被称为偏离样本,位于聚类边缘的样本,则称为外周样本,其余统一称为核心样本。

代码:dbscan.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc

x = []
with open("D:/pythonStudy/AI/notes/perf.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data)
x = np.array(x)
# min_samples 一个聚类中的最少样本数
model = sc.DBSCAN(eps=0.8, min_samples=5)
pred_y = model.fit_predict(x)
core_mask = np.zeros(len(x), dtype=bool)
core_mask[model.core_sample_indices_] = True
offset_mask = model.labels_ == -1
periphery_mask = ~(core_mask | offset_mask)

mp.figure("DBSCAN Cluster", facecolor="lightgray")
mp.title("DBSCAN Cluster", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
labels = set(pred_y)
cs = mp.get_cmap("brg", len(labels))(range(len(labels))) # cs为颜色数组
mp.scatter(
    x[core_mask][:, 0],
    x[core_mask][:, 1],
    c=cs[pred_y[core_mask]],
    s=60,
    label="Core"
)
mp.scatter(
    x[periphery_mask][:, 0],
    x[periphery_mask][:, 1],
    edgecolor=cs[pred_y[periphery_mask]],
    facecolor="none",
    s=60,
    label="Periphery"
)
mp.scatter(
    x[offset_mask][:, 0],
    x[offset_mask][:, 1],
    marker="x",
    c=cs[pred_y[offset_mask]],
    s=60,
    label="Offset"
)
mp.legend()
mp.show()

6. 轮廓系数

machine_learning

表示聚类划分内密外疏的程度。
轮廓系数由以下两个指标构成:
a:一个样本与其所在聚类其他样本的平均距离。
b:一个样本与其距离最近的另一个聚类中样本的平均距离。
针对这一个样本的轮廓系数:
s = b − a m a x ( a , b ) s = \frac {b-a} {max(a, b)} s=max(a,b)b−a​
针对一个数据集,其轮廓系数就是其中所有样本的轮廓系数的平均值。轮廓系数的值介于[-1, 1]区间,1表示完美聚类,-1表示错误聚类,0表示聚类重叠。

代码:score.py

import numpy as np
import matplotlib.pyplot as mp
import sklearn.cluster as sc
import sklearn.metrics as sm

x = []
with open("D:/pythonStudy/AI/notes/perf.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        x.append(data)
x = np.array(x)
clstrs, scores, models = np.arange(2, 11), [], []
for n_clusters in clstrs:
    model = sc.KMeans(init="k-means++", n_clusters=n_clusters)
    model.fit(x)
    score = sm.silhouette_score(x, model.labels_, sample_size=len(x), metric="euclidean")
    scores.append(score)
    models.append(model)
scores = np.array(scores)
best_index = scores.argmax()
best_clstr = clstrs[best_index]
best_score = scores[best_index]
print("best_class:", best_clstr, ", best_score:", best_score)
best_model = models[best_index]
centers = best_model.cluster_centers_

九、推荐引擎

1. 管线

-输入 -> 学习模型-输出 -> 学习模型-输出 -> …
所谓管线,其本质就是函数的级联调用,即用一个函数的返回值作为另一个函数的参数。

代码:map.py、reduce.py

def f1(x):
    return x + 3

x = 1
y = f1(1)
print(y)

X = [1, 2, 3]
# Y = list(map(f1, X))
Y = list(map(lambda x: x+3, X))
print(Y)
import functools

def f1(x, y):
    print("f1:", x, y)
    return x + y

a = [1, 2, 3]
print(a)
b = sum(a)
print(b)
c = functools.reduce(f1, a)
print(c)

代码:cc1.py、cc2.py

import functools

def f1(x):
    return x + 3

def f2(x):
    return x * 6

def f3(x):
    return x - 9

def function_composer(*fs):
    return functools.reduce(lambda fa, fb: lambda x: fa(fb(x)), fs)

a = 1
b = f3(f2(f1(a)))
print(b)
c = functools.reduce(lambda fa, fb: lambda x: fa(fb(x)), [f3, f2, f1])
print(c(a))
d = function_composer(f3, f2, f1)(a)
print(d)
import functools

def f1(a):
    return map(lambda x: x+3, a)

def f2(a):
    return map(lambda x: x*6, a)

def f3(a):
    return map(lambda x: x-9, a)

def function_composer(*fs):
    return functools.reduce(lambda fa, fb: lambda x: fa(fb(x)), fs)

a = [1, 2, 3]
b = list(f3(f2(f1(a))))
print(b)
c = functools.reduce(lambda fa, fb: lambda x: fa(fb(x)), [f3, f2, f1])
print(list(c(a)))
d = list(function_composer(f3, f2, f1)(a))
print(d)

代码:pipe.py

import numpy as np
import sklearn.datasets as sd
import sklearn.feature_selection as fs
import sklearn.ensemble as se
import sklearn.pipeline as pl
import sklearn.model_selection as ms
import matplotlib.pyplot as mp

x, y = sd._samples_generator.make_classification(n_informative=4, n_features=20, n_redundant=0, random_state=5)
# n_informative隐含特征数,n_features特征数
skb = fs.SelectKBest(fs.f_regression, k=5)
rfc = se.RandomForestClassifier(n_estimators=25, max_depth=4)
model = pl.Pipeline([("selector", skb), ("classifier", rfc)])
print(ms.cross_val_score(model, x, y, cv=10, scoring="f1_weighted").mean())
model.set_params(selector__k=2, classifier__n_estimators=10)
print(ms.cross_val_score(model, x, y, cv=10, scoring="f1_weighted").mean())
model.fit(x, y)
selected_mask = model.named_steps["selector"].get_support() # 返回特征掩码
print(selected_mask)
selected_indices = np.where(selected_mask)[0]
print(selected_indices)
x = x[:, selected_indices]
model.fit(x, y)

l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure("Selector-Classifier Pipeline", facecolor="lightgray")
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap="Dark2")
mp.scatter(x[:, 0], x[:, 1], c=y, cmap="cool", s=60)

mp.show()

2. 寻找最近邻(FNN)

惰性学习算法
machine_learning

sklearn.neighbors.NearestNeighbors(n_neighbors=邻居数, algorithm=算法(“ball_tree”)) -> FNN模型
FNN模型.fit(已知样本集合)
FNN.kneighbors(待求样本集合) -> 距离矩阵, 近邻下标索引矩阵

代码:fnn.py

import numpy as np
import sklearn.neighbors as sn
import matplotlib.pyplot as mp
import matplotlib.patches as mc

train_x = np.array([
    [6, 7],
    [4.7, 8.5],
    [3.4, 8.5],
    [2, 7],
    [2, 5],
    [3.4, 3],
    [6, 2],
    [8.6, 3],
    [10, 5],
    [10, 7],
    [8.6, 8.5],
    [7.3, 8.5],
])

model = sn.NearestNeighbors(n_neighbors=3, algorithm="ball_tree")
model.fit(train_x)
test_x = np.array([
    [4.7, 8],
    [4, 6.5],
    [4, 6],
    [4.7, 5],
    [5.7, 4.6],
    [6.3, 4.6],
    [7.3, 5],
    [8, 6],
    [8, 6.5],
    [7.3, 8],
])
nn_distance, nn_indices = model.kneighbors(test_x)
print(nn_distance, nn_indices, sep="\n")

mp.figure("Find Nearest Neighbors", facecolor="lightgray")
mp.title("Find Nearest Neighbors", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(train_x[:, 0], train_x[:, 1], c="k", zorder=2)
cs = mp.get_cmap("gist_rainbow", len(nn_indices))(range(len(nn_indices)))
for i, (x, nn_index) in enumerate(zip(test_x, nn_indices)):
    nn_index = nn_indices[i]
    mp.gca().add_patch(mc.Polygon(train_x[nn_index], ec="none", fc=cs[i], alpha=0.25, zorder=0))
    mp.scatter(x[0], x[1], c=cs[i], s=80, zorder=1)
mp.axis("equal")

mp.show()

3.FNN分类和回归

machine_learning
machine_learning

遍历训练集中的所有样本,计算每个样本与待测样本的距离,并从中挑选出K的最近邻。根据与距离成反比的权重,做加权投票(分类)或平均(回归),得到待测样本的类别标签或预测值。
代码:knnc.py

import numpy as np
import sklearn.neighbors as sn
import matplotlib.pyplot as mp

train_x, train_y = [], []
with open("D:/pythonStudy/AI/notes/knn.txt", "r") as f:
    for line in f.readlines():
        data = [float(substr) for substr in line.split(",")]
        train_x.append(data[:-1])
        train_y.append(data[-1])
train_x = np.array(train_x)
train_y = np.array(train_y, dtype=int)
model = sn.KNeighborsClassifier(n_neighbors=10, weights="distance")
model.fit(train_x, train_y)
l, r, h = train_x[:, 0].min() - 1, train_x[:, 0].max() + 1, 0.005
b, t, v = train_x[:, 1].min() - 1, train_x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
test_x = np.array([
    [2.2, 6.2],
    [3.6, 1.8],
    [4.5, 3.6],
])
pred_test_y = model.predict(test_x)
nn_distances, nn_indeces = model.kneighbors(test_x)

mp.figure("KNN Classification", facecolor="lightgray")
mp.title("KNN Classification", fontsize=14)
mp.xlabel("x", fontsize=12)
mp.ylabel("y", fontsize=12)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap="gray")
classes = np.unique(train_y)
classes.sort()
cs = mp.get_cmap("brg", len(classes))(classes)
mp.scatter(train_x[:, 0], train_x[:, 1], c=cs[train_y], s=60)
mp.scatter(test_x[:, 0], test_x[:, 1], marker="D", c=cs[pred_test_y], s=60)
for nn_index, y in zip(nn_indeces, pred_test_y):
    mp.scatter(train_x[nn_index, 0], train_x[nn_index, 1], marker="D", edgecolor=cs[np.ones_like(nn_index) * y], facecolor="none", s=180)
mp.show()

代码:knnr.py

import numpy as np
import sklearn.neighbors as sn
import matplotlib.pyplot as mp


train_x = np.random.rand(100, 1) * 10 - 5
train_y = np.sinc(train_x).ravel()
train_y += 0.2 * (0.5 - np.random.rand(train_y.size))
model = sn.KNeighborsRegressor(n_neighbors=10, weights="distance")
model.fit(train_x, train_y)
test_x = np.linspace(-5, 5, 10000).reshape(-1, 1)
test_y = np.sinc(test_x).ravel()
pred_test_y = model.predict(test_x)

mp.figure("KNN Regression", facecolor="lightgray")
mp.title("KNN Regression", fontsize=14)
mp.xlabel("x", fontsize=12)
mp.ylabel("y", fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(train_x, train_y, c="dodgerblue", s=60, label="Training")
mp.plot(
    test_x, test_y,
    "--",
    c="limegreen",
    linewidth=1,
    label="Testing"
)
mp.plot(
    test_x, pred_test_y,
    "--",
    c="orangered",
    linewidth=1,
    label="Predicted Testing"
)
mp.legend()
mp.show()

4. 欧氏(欧几里得)距离

(x1, y1) < ----- > (x2, y2)
( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 \sqrt{(x1-x2)^2 + (y1-y2)^2} (x1−x2)2+(y1−y2)2
( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 + ( z 1 − z 2 ) 2 \sqrt {(x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2} (x1−x2)2+(y1−y2)2+(z1−z2)2
(a, b, c, …) < ----- > (A, B, C, …)

欧氏距离得分= 1 1 + 欧 氏 距 离 \frac {1}{1+欧氏距离} 1+欧氏距离1​
【0 <–不相似-----相似–> 1】

代码:es.py
用 户 1 用 户 2 用 户 3 . . . 用 户 1 1 0.8 0.9 . . . 用 户 2 0.8 1 0.7 . . . 用 户 3 . . . \begin {matrix} &用户1 & 用户2 & 用户3 & ... \\ 用户1& 1 & 0.8 & 0.9 & ... \\ 用户2&0.8 & 1 & 0.7 & ... \\ 用户3&\\ ... \end {matrix} 用户1用户2用户3...​用户110.8​用户20.81​用户30.90.7​.........​

import json
import numpy as np

with open("D:/pythonStudy/AI/notes/ratings.json", "r") as f:
    ratings = json.loads(f.read())
users, scmat = list(ratings.keys()), []
for user1 in users:
    scrow = []
    for user2 in users:
        movies = set()
        for movie in ratings[user1]:
            if movie in ratings[user2]:
                movies.add(movie)
        if len(movies) == 0:
            score = 0
        else:
            x, y = [], []
            for movie in movies:
                x.append(ratings[user1][movie])
                y.append(ratings[user2][movie])
            x = np.array(x)
            y = np.array(y)
            score = 1 / (1 + np.sqrt(((x-y)**2).sum()))
        scrow.append(score)
    scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
for scrow in scmat:
    print(" ".join("{:>5.2f}".format(score) for score in scrow))

5. 皮(尔逊)氏距离得分

用两个样本的协方差(-1, 1)表示相似度。
A B C 1 5 1 3 2 10 0 5 \begin {matrix} & A & B & C \\ 1 & 5 & 1 & 3 \\ 2 & 10& 0 & 5\\ \end {matrix} 12​A510​B10​C35​
代码:ps.py

import json
import numpy as np

with open("D:/pythonStudy/AI/notes/ratings.json", "r") as f:
    ratings = json.loads(f.read())
users, scmat = list(ratings.keys()), []
for user1 in users:
    scrow = []
    for user2 in users:
        movies = set()
        for movie in ratings[user1]:
            if movie in ratings[user2]:
                movies.add(movie)
        if len(movies) == 0:
            score = 0
        else:
            x, y = [], []
            for movie in movies:
                x.append(ratings[user1][movie])
                y.append(ratings[user2][movie])
            x = np.array(x)
            y = np.array(y)
            score = np.corrcoef(x, y)[0, 1]
        scrow.append(score)
    scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
for scrow in scmat:
    print(" ".join("{:>5.2f}".format(score) for score in scrow))

根据样本的相似程度排序
代码:sim.py

import json
import numpy as np

with open("D:/pythonStudy/AI/notes/ratings.json", "r") as f:
    ratings = json.loads(f.read())
users, scmat = list(ratings.keys()), []
for user1 in users:
    scrow = []
    for user2 in users:
        movies = set()
        for movie in ratings[user1]:
            if movie in ratings[user2]:
                movies.add(movie)
        if len(movies) == 0:
            score = 0
        else:
            x, y = [], []
            for movie in movies:
                x.append(ratings[user1][movie])
                y.append(ratings[user2][movie])
            x = np.array(x)
            y = np.array(y)
            score = np.corrcoef(x, y)[0, 1]
        scrow.append(score)
    scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
for scrow in scmat:
    print(" ".join("{:>5.2f}".format(score) for score in scrow))
for i, user in enumerate(users):
    sorted_indices = scmat[i].argsort()[::-1]
    sorted_indices = sorted_indices[sorted_indices != i]
    similar_users = users[sorted_indices]
    similar_scores = scmat[i, sorted_indices]
    print(user, "->", similar_users, similar_scores)

生成针对每个用户的推荐列表
代码:rcm.py

import json
import numpy as np

with open("D:/pythonStudy/AI/notes/ratings.json", "r") as f:
    ratings = json.loads(f.read())
users, scmat = list(ratings.keys()), []
for user1 in users:
    scrow = []
    for user2 in users:
        movies = set()
        for movie in ratings[user1]:
            if movie in ratings[user2]:
                movies.add(movie)
        if len(movies) == 0:
            score = 0
        else:
            x, y = [], []
            for movie in movies:
                x.append(ratings[user1][movie])
                y.append(ratings[user2][movie])
            x = np.array(x)
            y = np.array(y)
            score = np.corrcoef(x, y)[0, 1]
        scrow.append(score)
    scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
for scrow in scmat:
    print(" ".join("{:>5.2f}".format(score) for score in scrow))
for i, user in enumerate(users):
    sorted_indices = scmat[i].argsort()[::-1]
    sorted_indices = sorted_indices[sorted_indices != i]
    similar_users = users[sorted_indices]
    similar_scores = scmat[i, sorted_indices]
    positive_mask = similar_scores > 0
    similar_users = similar_users[positive_mask]
    similar_scores = similar_scores[positive_mask]
    score_sums, weight_sums = {}, {}
    for similar_user, similar_score in zip(similar_users, similar_scores):
        for movie, score in ratings[similar_user].items():
            if movie not in score_sums.keys():
                score_sums[movie] = 0
            score_sums[movie] += score * similar_score
            if movie not in weight_sums.keys():
                weight_sums[movie] = 0
            weight_sums[movie] += similar_score
    movie_ranks = {}
    for movie, score_sum in score_sums.items():
        movie_ranks[movie] = score_sum / weight_sums[movie]
    sorted_indice = np.array(list(movie_ranks.values())).argsort()[::-1]
    recomms = np.array(list(movie_ranks.keys()))[sorted_indice]
    print(user, "->", recomms)

十、文本分析

import nltk - 自然语言工具包

1. 分词

从完整的文章或段落中,划分出若干独立的语义单元,如句或者词。

代码:tkn.py

import nltk.tokenize as tk

doc = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
    print(i + 1, token)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print(i + 1, token)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print(i + 1, token)

2. 词干提取

从单词中抽取主要成分,未必是合法的词汇。

代码:stm.py

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

words = ["table", "probably", "wolves", "playing", "is", "dog", "the", "beaches", "grounded", "dreamt", "envision"]
porter = pt.PorterStemmer() # 最宽松,留的多
lancaster = lc.LancasterStemmer() # 更严格,留的少
snowball = sb.SnowballStemmer("english") # 折中
for word in words:
    pstem = porter.stem(word)
    lstem = lancaster.stem(word)
    sstem = snowball.stem(word)
    print("{:10} {:10} {:10} {:10}".format(word, pstem, lstem, sstem))

3. 词型还原

从名词或动词中抽取原型成分,依然保证其合法性。

代码:lmm.py

import nltk.stem as ns

words = ["table", "probably", "wolves", "playing", "is", "dog", "the", "beaches", "grounded", "dreamt", "envision"]
lmm = ns.WordNetLemmatizer()
for word in words:
    n = lmm.lemmatize(word, pos="n")
    v = lmm.lemmatize(word, pos="v")
    print("{:10} {:10} {:10}".format(word, n, v))

4. 词袋模型

the brown dog is running
the black dog is in the black room
runing in the room is forbidden
t h e b r o w n d o g i s r u n n i n g b l a c k i n r o o m f o r b i d d e n 1 1 1 1 1 1 0 0 0 0 2 2 0 1 1 0 2 1 1 0 3 1 0 0 1 1 0 1 1 1 \begin{matrix} & the&brown&dog&is&running&black&in&room&forbidden \\ 1 &1&1&1&1&1&0&0&0&0 \\ 2 &2&0&1&1&0&2&1&1&0 \\ 3 &1&0&0&1&1&0&1&1&1 \\ \end{matrix} 123​the121​brown100​dog110​is111​running101​black020​in011​room011​forbidden001​

代码:bow.py

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft

doc = "The brown dog is running. The black dog is in the black room. Running in the room is forbidden."
print(doc)
sentences = tk.sent_tokenize(doc)
for i, sentence in enumerate(sentences):
    print(i+1, sentence)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names_out()
print(words)

5. 词频

单 词 在 句 子 中 出 现 的 次 数 句 子 的 总 单 词 数 \frac {单词在句子中出现的次数}{句子的总单词数} 句子的总单词数单词在句子中出现的次数​

代码:tf.py

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp

doc = "The brown dog is running. The black dog is in the black room. Running in the room is forbidden."
print(doc)
sentences = tk.sent_tokenize(doc)
for i, sentence in enumerate(sentences):
    print(i+1, sentence)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names_out()
print(words)
# l1范数 绝对值之和,l2范数 平方之和
tf = sp.normalize(bow, norm="l1")
print(tf)

6.词频逆文档频率(TF-IDF)

词 频 × 逆 文 档 频 率 = 词 频 × 总 样 本 数 包 含 该 单 词 的 样 本 数 词频\times逆文档频率 = 词频\times \frac {总样本数}{包含该单词的样本数} 词频×逆文档频率=词频×包含该单词的样本数总样本数​

代码:tfidf.py

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft

doc = "The brown dog is running. The black dog is in the black room. Running in the room is forbidden."
print(doc)
sentences = tk.sent_tokenize(doc)
for i, sentence in enumerate(sentences):
    print(i+1, sentence)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names_out()
print(words)
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

文本分类,核心问题预测
xxxxxxxx -> 加解密

xxxxxxxx -> 摩托车

xxxxxxxx -> 棒球

xxxxxxxx -> ?

代码:doc.py

import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb

train = sd.load_files("D:/pythonStudy/AI/notes/20news", encoding="latin1", shuffle=True, random_state=7)
train_data = train.data
train_y = train.target
categories = train.target_names
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
tt = ft.TfidfTransformer()
# 不用toarray为稀疏矩阵,不占内存
train_x = tt.fit_transform(train_bow)
# 多项分布的朴素贝叶斯模型
model = nb.MultinomialNB()
model.fit(train_x, train_y)
test_data = [
    "The curveballs of right handed pitchers tend to curve to the left.",
    "Caesar cipher is an ancient form of encryption",
    "This two-wheeler is really good on slippery roads"
]
test_bow = cv.transform(test_data)
test_x = tt.transform(test_bow)
pred_test_y = model.predict(test_x)
for sentence, index in zip(test_data, pred_test_y):
    print(sentence, "->", categories[index])

性别识别
代码:gndr.py

import random
import numpy as np
import nltk.corpus as nc
import nltk.classify as cf

male_names = nc.names.words("male.txt")
female_names = nc.names.words("female.txt")
models, acs = [], []
for n_letters in range(1, 6):
    data = []
    for male_name in male_names:
        feature = {"feature": male_name[-n_letters:].lower()}
        data.append((feature, "male"))
    for female_name in female_names:
        feature = {"feature": female_name[-n_letters:].lower()}
        data.append((feature, "female"))
    random.seed(7)
    random.shuffle(data)
    train_data = data[:int(len(data) / 2)]
    test_data = data[int(len(data) / 2):]
    model = cf.NaiveBayesClassifier.train(train_data)
    ac = cf.accuracy(model, test_data)
    models.append(model)
    acs.append(ac)
best_index = np.array(acs).argmax()
best_letters = best_index + 1
best_model = models[best_index]
best_ac = acs[best_index]
print(best_letters, best_ac)
names = ["Leonardo", "Amy", "Sam", "Tom", "Katherine", "Taylor", "Susanne"]
genders = []
for name in names:
    feature = {"feature": name[-best_letters:].lower()}
    gender = best_model.classify(feature)
    genders.append(gender)
for name, gender in zip(names, genders):
    print(name, "->", gender)

情感分析
xxx xxx xxx … xxx
True False False … True -> POSITIVE
代码:sent.py

import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu

pdata = []
fileids = nc.movie_reviews.fileids("pos")
for fileid in fileids:
    feature = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        feature[word] = True
    pdata.append((feature, "POSITIVE"))
ndata = []
fileids = nc.movie_reviews.fileids("neg")
for fileid in fileids:
    feature = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        feature[word] = True
    ndata.append((feature, "NEGATIVE"))

pnum, nnum = int(len(pdata) * 0.8), int(len(ndata) * 0.8)
train_data = pdata[:pnum] + ndata[:nnum]
test_data = pdata[pnum:] + ndata[nnum:]
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print(ac)
tops = model.most_informative_features()
print(tops[:10])
reviews = [
    "It is an amazing movie.",
    "This is a dull movie. I would never recommend it to anyone.",
    "The cinematography is pretty great in this movie.",
    "The direction was terrible and the story was all over the place."
]
sents, probs = [], []
for review in reviews:
    feature = {}
    words = review.split()
    for word in words:
        feature[word] = True
    pcls = model.prob_classify(feature)
    sent = pcls.max()
    prob = pcls.prob(sent)
    sents.append(sent)
    probs.append(prob)
for review, sent, prob in zip(reviews, sents, probs):
    print(review, "->", sent, round(prob, 2))

主题词抽取
import gensim.models.ldamodel
LDA, Latent Dirichlet Allocation
隐含迪利克雷分布
代码:topic.py

import warnings
import nltk.tokenize as tk
import nltk.corpus as sc
import nltk.stem.snowball as sb
import gensim.models.ldamodel as gm
import gensim.corpora as gc
warnings.filterwarnings("ignore", category=UserWarning)

doc = []
with open("D:/pythonStudy/AI/notes/topic.txt", "r") as f:
    for line in f.readlines():
        doc.append(line[:-1])
tokenizer = tk.RegexpTokenizer(r"\w+")
stopwords = sc.stopwords.words("english")
stemmer = sb.SnowballStemmer("english")
lines_tokens = []
for line in doc:
    tokens = tokenizer.tokenize(line.lower())
    line_tokens = []
    for token in tokens:
        if token not in stopwords:
            token = stemmer.stem(token)
            line_tokens.append(token)
    lines_tokens.append(line_tokens)
dic = gc.Dictionary(lines_tokens)
bow = []
for line_tokens in lines_tokens:
    row = dic.doc2bow(line_tokens)
    bow.append(row)
n_topics = 2
model = gm.LdaModel(bow, num_topics=n_topics, id2word=dic)
topics = model.print_topics(num_topics=n_topics, num_words=4)
print(topics)

十一、音频识别

1. 模拟音频和数字音频

		声带-->机械振动-->频率+响度-->声场强度=f(时间)
										|
										v
			耳朵<——播放器机械振动<——电压/电流=f(时间)<——————
										|A/D	  	  |
										v		  	  |
  		.wav文件<——存储<——数字音频<——量化<——采样            |
  							|						  |
  							v						  |
  						   传输——>回放软件———D/A——————————

2. 借助傅里叶变换提起频率特征

代码:sig.py

import numpy as np
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp

sample_rate, sigs = wf.read("D:/pythonStudy/AI/notes/freq.wav")
sigs = sigs / 2 ** 15
times = np.arange(len(sigs)) / sample_rate
freqs = np.fft.fftfreq(len(sigs), d=1/sample_rate)
ffts = np.fft.fft(sigs)
pows = np.abs(ffts)

mp.figure("Audio Signal", facecolor="lightgray")
mp.title("Audio Signal", fontsize=20)
mp.xlabel("Time", fontsize=14)
mp.ylabel("Signal", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(
    times, sigs,
    c="dodgerblue",
    label="Signal"
)
mp.legend()

mp.figure("Audio Frequency", facecolor="lightgray")
mp.title("Audio Frequency", fontsize=20)
mp.xlabel("Frequency", fontsize=14)
mp.ylabel("Power", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(
    freqs[freqs >= 0], pows[freqs >= 0],
    c="orangered",
    label="Frequency"
)
mp.legend()
mp.show()

3. MFCC

在频率特征的基础上结合语音的特点选择主要成分——MFCC,梅尔频率倒谱系数
关 键 频 率 1 关 键 频 率 2 关 键 频 率 3 . . . 时 域 区 间 1 30 40 20 . . . 时 域 区 间 2 10 20 50 . . . 时 域 区 间 3 40 30 60 . . . . . . \begin{matrix} &关键频率1&关键频率2&关键频率3&... \\ 时域区间1& 30 & 40 & 20 & ... \\ 时域区间2& 10 & 20 & 50 & ... \\ 时域区间3& 40 & 30 & 60 & ... \\ ... \end{matrix} 时域区间1时域区间2时域区间3...​关键频率1301040​关键频率2402030​关键频率3205060​............​
代码:mfcc.py

import scipy.io.wavfile as wf
import matplotlib.pyplot as mp
import python_speech_features as sf

# sample_rate, sigs = wf.read("D:/pythonStudy/AI/notes/freq.wav")
sample_rate, sigs = wf.read("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/speeches/training/pineapple/pineapple01.wav")
mfcc = sf.mfcc(sigs, sample_rate)

mp.matshow(mfcc, cmap="gist_rainbow", fignum="MFCC")
mp.title("MFCC", fontsize=20)
mp.xlabel("Feature", fontsize=14)
mp.ylabel("Sample", fontsize=14)
mp.tick_params(which="both", top=False, labeltop=False, labelbottom=True, labelsize=10)
mp.show()

4. 语音识别

HMM:隐马尔科夫模型
音频样本 -> MFCC -> HMM -> 标签

代码:spch.py

import os
import warnings
import numpy as np
import scipy.io.wavfile as wf
import python_speech_features as sf
import hmmlearn.hmm as hl
warnings.filterwarnings("ignore", category=DeprecationWarning)
np.seterr(all="ignore")

def search_speeches(directory, speeches):
    directory = os.path.normpath(directory)
    if not os.path.isdir(directory):
        raise(IOError("The directory" + directory + "' doesn't exist!"))
    for entry in os.listdir(directory):
        label = directory[directory.rfind(os.path.sep) + 1:]
        path = os.path.join(directory, entry)
        if os.path.isdir(path):
            search_speeches(path, speeches)
        elif os.path.isfile(path) and path.endswith(".wav"):
            if label not in speeches:
                speeches[label] = []
            speeches[label].append(path)
def get_data(path):
    train_speeches = {}
    search_speeches(path, train_speeches)
    train_x, train_y = [], []
    for label, filenames in train_speeches.items():
        mfccs = np.array([])
        for filename in filenames:
            sample_rate, sigs = wf.read(filename)
            mfcc = sf.mfcc(sigs, sample_rate)
            if len(mfccs) == 0:
                mfccs = mfcc
            else:
                mfccs = np.append(mfccs, mfcc, axis=0) # axis=0为纵向拼接
        train_x.append(mfccs)
        train_y.append(label)
    return train_x, train_y
train_x, train_y = get_data("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/speeches/training")
test_x, test_y = get_data("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/speeches/testing")
models = {}
for mfccs, label in zip(train_x, train_y):
    model = hl.GaussianHMM(n_components=4, covariance_type="diag", n_iter=1000)
    models[label] = model.fit(mfccs)

pred_test_y = []
for mfccs in test_x:
    best_score, best_label = None, None
    for label, model in models.items():
        score = model.score(mfccs)
        if (best_score is None) or (best_score < score):
            best_score, best_label = score, label
    pred_test_y.append(best_label)
for y, pred_y in zip(test_y, pred_test_y):
    print(y, "->", pred_y)

声音 -> 数字音频流 -> MFCC -> 学习模型 -> 文本 -> TFIDF -> 模型 -> 语义 -> 应答文本 -> 语音
|<---------------------语音识别------------------------->|<-------自然语言处理------->|<-----语音合成----->|

十二、图像识别

1. 机器视觉工具包

OpenCV-Python

代码:basic.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/forest.jpg")
# print(original.shape)
cv.imshow("Original", original)
blue = np.zeros_like(original)
blue[..., 0] = original[..., 0] # 0 - 蓝色通道
cv.imshow("Blue", blue)
green = np.zeros_like(original)
green[..., 1] = original[..., 1] # 1 - 绿色通道
cv.imshow("Green", green)
red = np.zeros_like(original)
red[..., 2] = original[..., 2] # 2 - 红色通道
cv.imshow("Red", red)
h, w = original.shape[:2]
l, t = int(w / 4), int(h / 4)
r, b = int(w * 3 / 4), int(h * 3 / 4)
cropped = original[t:b, l:r]
cv.imshow("Cropped", cropped)
'''
scaled = cv.resize(original, (w * 2, int(h / 2)), interpolation=cv.INTER_LINEAR)
'''
scaled = cv.resize(original, None, fx=2, fy=0.5, interpolation=cv.INTER_LINEAR)
cv.imshow("Scaled", scaled)
cv.waitKey()
cv.imwrite("C:/Users/Hasee/Desktop/green.jpg", green)

2. 边缘检测

代码:edge.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/chair.jpg")
cv.imshow("Original", original)
canny = cv.Canny(original, 50, 240)
cv.imshow("Canny", canny)
cv.waitKey()

3. 均衡直方提升亮度

代码:eq.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/sunrise.jpg")
cv.imshow("Original", original)
gray = cv.cvtColor(original, cv.COLOR_BGR2GRAY) # 彩色变灰度
cv.imshow("Gray", gray)
equalized_gray = cv.equalizeHist(gray) # 均衡直方
cv.imshow("Equalized", equalized_gray)
yuv = cv.cvtColor(original, cv.COLOR_BGR2YUV) # 彩色变yuv(y亮度,u色度,v饱和度)
cv.imshow("YUV", yuv)
yuv[..., 0] = cv.equalizeHist(yuv[..., 0])
equalized_color = cv.cvtColor(yuv, cv.COLOR_YUV2BGR)
cv.imshow("Equalized Color", equalized_color)
cv.waitKey()

4. 角点检测

代码:corner.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/box.png")
cv.imshow("Original", original)
gray = cv.cvtColor(original, cv.COLOR_BGR2GRAY)
cv.imshow("Gray", gray)
corners = cv.cornerHarris(gray, 7, 5, 0.04) # (, 水平阈值, 垂直阈值, 检测精度步长)
corners = cv.dilate(corners, None) # 锐化
mixture = original.copy()
mixture[corners > corners.max() * 0.01] = [0, 0, 255]
cv.imshow("Mixture", mixture)

cv.waitKey()

5. star特征检测

代码:star.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/table.jpg")
cv.imshow("Original", original)
gray = cv.cvtColor(original, cv.COLOR_BGR2GRAY)
cv.imshow("Gray", gray)
star = cv.xfeatures2d.StarDetector_create() # xfeatures2d在python3.7以上不支持
kepoints = star.detect(gray)
mixture = original.copy()
cv.drawKeypoints(original, kepoints, mixture, flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
cv.imshow("Mixture", mixture)

cv.waitKey()

6. sift特征检测

代码:sift.py

import cv2 as cv
import numpy as np

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/table.jpg")
cv.imshow("Original", original)
gray = cv.cvtColor(original, cv.COLOR_BGR2GRAY)
cv.imshow("Gray", gray)
sift = cv.SIFT_create()
kepoints = sift.detect(gray)
mixture = original.copy()
cv.drawKeypoints(original, kepoints, mixture, flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
cv.imshow("Mixture", mixture)

cv.waitKey()

7. 特征(描述)矩阵

代码:desc.py

import cv2 as cv
import numpy as np
import matplotlib.pyplot as mp

original = cv.imread("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/table.jpg")
cv.imshow("Original", original)
gray = cv.cvtColor(original, cv.COLOR_BGR2GRAY)
cv.imshow("Gray", gray)
sift = cv.SIFT_create()
kepoints = sift.detect(gray)
keypoints, desc = sift.compute(gray, kepoints)
mixture = original.copy()
cv.drawKeypoints(original, keypoints, mixture, flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
cv.imshow("Mixture", mixture)

mp.matshow(desc, cmap="jet", fignum="Description")
mp.title("Description", fontsize=20)
mp.xlabel("Feature", fontsize=14)
mp.ylabel("Sample", fontsize=14)
mp.tick_params(which="both", top=False, labeltop=False, labelbottom=True, labelsize=10)
mp.show()

8. 图像识别

代码:obj.py

import os
import warnings
import numpy as np
import cv2 as cv
import hmmlearn.hmm as hl
warnings.filterwarnings("ignore", category=DeprecationWarning)

np.seterr(all="ignore")

def search_objects(directory):
    """os.walk(dir) 生成迭代器对象,会迭代dir文件夹内的所有文件和文件夹,"""
    directory = os.path.normpath(directory)
    if not os.path.isdir(directory):
        raise IOError("The directory '" + directory + "' doesn't exist!")
    objects = {}
    for curdir, subdir, files in os.walk(directory):
        for jpeg in (file for file in files if file.endswith(".jpg")):
            path = os.path.join(curdir, jpeg)
            label = path.split(os.path.sep)[-2]
            if label not in objects:
                objects[label] = []
            objects[label].append(path)
    return objects

train_objects = search_objects("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/objects/training")
train_x, train_y = [], []
for label, filenames in train_objects.items():
    descs = np.array([])
    for filename in filenames:
        image = cv.imread(filename)
        gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
        # 尺寸统一等比例缩放
        h, w = gray.shape[:2]
        f = 200 / min(h, w)
        gray = cv.resize(gray, None, fx=f, fy=f)
        # gftt = cv.GFTTDetector_create() # 训练速度慢
        fast = cv.FastFeatureDetector()
        sift = cv.SIFT_create()
        # keypoints = gftt.detect(gray)
        keypoints = fast.detect(gray)
        _, desc = sift.compute(gray, keypoints)
        if len(descs) == 0:
            descs = desc
        else:
            descs = np.append(descs, desc, axis=0)
    train_x.append(descs)
    train_y.append(label)

# 建立学习模型
models = {}
for descs, lable in zip(train_x, train_y):
    # n_components-几种隐藏状态,covariance_type用啥表示协方差
    model = hl.GaussianHMM(n_components=4, covariance_type="diag", n_iter=1000)
    models[lable] = model.fit(descs)

test_objects = search_objects("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/objects/testing")
test_x, test_y, test_z = [], [], []
for label, filenames in test_objects.items():
    test_z.append([])
    descs = np.array([])
    for filename in filenames:
        image = cv.imread(filename)
        test_z[-1].append(image)
        gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
        # 尺寸统一等比例缩放
        h, w = gray.shape[:2]
        f = 200 / min(h, w)
        gray = cv.resize(gray, None, fx=f, fy=f)
        # gftt = cv.GFTTDetector_create()
        fast = cv.FastFeatureDetector()
        sift = cv.SIFT_create()
        # keypoints = gftt.detect(gray)
        keypoints = fast.detect(gray)
        _, desc = sift.compute(gray, keypoints)
        if len(descs) == 0:
            descs = desc
        else:
            descs = np.append(descs, desc, axis=0)
    test_x.append(descs)
    test_y.append(label)

pred_test_y = []
for descs in test_x:
    best_score, best_label = None, None
    for label, model in models.items():
        score = model.score(descs)
        if (best_score is None) or (best_score < score):
            best_score, best_label = score, label
    pred_test_y.append(best_label)
print(pred_test_y)
i = 0
for label, pred_label, images in zip(test_y, pred_test_y, test_z):
    for image in images:
        i += 1
        cv.imshow("{} - {} {} {}".format(i, label, "==" if label == pred_label else "!=", pred_label), image)
cv.waitKey()

十三、人脸识别

1. 视频捕捉

代码:vidcap.py

import numpy as np
import cv2 as cv

vc = cv.VideoCapture(0) # 0 - 视频捕捉设备的编号
while True:
    frame = vc.read()[1]
    cv.imshow("VideoCapture", frame)
    if cv.waitKey(33) == 27: # 27是esc键,33毫秒,30fps
        break
vc.release()
cv.destroyAllWindows() # 销毁隐藏的窗口,因为视频帧会有缓存

2. 人脸定位

基于哈尔级联分类器的人脸定位
代码:haar.py

import numpy as np
import cv2 as cv

fd = cv.CascadeClassifier("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/haar/face.xml")
vc = cv.VideoCapture(0)
while True:
    frame = vc.read()[1]
    faces = fd.detectMultiScale(frame, 1.3, 5) # 常用值介于0-10
    for l, t, w, h in faces:
        a, b = int(w / 2), int(h / 2)
        cv.ellipse(frame, (l+a, t+b), (a, b), 0, 0, 360, (255, 0, 255), 2) # ellipse划椭圆,(frame, (圆心的坐标), (椭圆的两半径), 旋转角度, 圆弧起始, 颜色, 线宽)
    cv.imshow("VideoCapture", frame)
    if cv.waitKey(33) == 27:
        break
vc.release()
cv.destroyAllWindows()

3. 人脸识别

基于OpenCV的局部二值模式直方图模型(LBPH)
代码:face.py

import os
import numpy as np
import cv2 as cv
import sklearn.preprocessing as sp

fd = cv.CascadeClassifier("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/haar/face.xml")

def search_faces(directory):
    directory = os.path.normpath(directory)
    if not os.path.isdir(directory):
        raise IOError("The directory '" + directory + "' doesn't exist!")
    faces = {}
    for curdir, subdir, files in os.walk(directory):
        for jpeg in (file for file in files if file.endswith(".jpg")):
            path = os.path.join(curdir, jpeg)
            label = path.split(os.path.sep)[-2]
            if label not in faces:
                faces[label] = []
            faces[label].append(path)
    return faces

train_faces = search_faces("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/faces/training")
codec = sp.LabelEncoder()
codec.fit(list(train_faces.keys())) # 只建立表,不拿值
train_x, train_y = [], []
for label, filenames in train_faces.items():
    for filename in filenames:
        image = cv.imread(filename)
        gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
        faces = fd.detectMultiScale(gray, 1.1, 2, minSize=(100, 200)) # 图片最小尺寸不识别
        for l, t, w, h in faces:
            train_x.append(gray[t:t+h, l:l+w])
            train_y.append(int(codec.transform([label])[0]))
train_y = np.array(train_y)

# 局部二值模式直方图模型人脸识别器
model = cv.face.LBPHFaceRecognizer_create()
model.train(train_x, train_y)

test_faces = search_faces("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/faces/testing")
test_x, test_y, test_z = [], [], []
for label, filenames in test_faces.items():
    for filename in filenames:
        image = cv.imread(filename)
        gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
        faces = fd.detectMultiScale(gray, 1.1, 2, minSize=(100, 200)) # 图片最小尺寸不识别
        for l, t, w, h in faces:
            test_x.append(gray[t:t+h, l:l+w])
            test_y.append(int(codec.transform([label])[0]))
            a, b = int(w / 2), int(h / 2)
            cv.ellipse(image, (l+a, t+b), (a, b), 0, 0, 360, (255, 0, 255), 2)
            test_z.append(image)
test_y = np.array(test_y)

pred_test_y = []
for face in test_x:
    pred_code = model.predict(face)[0]
    pred_test_y.append(pred_code)
escape = False
while not escape:
    for code, pred_code, image in zip(test_y, pred_test_y, test_z):
        label = codec.inverse_transform([code])[0]
        pred_label = codec.inverse_transform([pred_code])[0]
        text = "{} {} {}".format(label, "==" if label == pred_label else "!=", pred_label)
        cv.putText(image, text, (10, 60), cv.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 255), 6)
        cv.imshow("Recognizing...", image)
        if cv.waitKey(2000) == 27:
            escape = True
            break

十四、成分分析(CA)

1. 主成分分析(PCA)

machine_learning

代码:np.py

import numpy as np

# 原始样本
A = np.mat("3 2000; 2 3000; 4 5000; 5 8000; 1 2000", dtype=float)
print("A", A, sep="\n")
# 均值为0,极差为1,归一化缩放
mu = A.mean(axis=0)
s = A.max(axis=0) - A.min(axis=0)
X = (A - mu) / s
print("X", X, sep="\n")
# 协方差矩阵
SIGMA = X.T * X
print("SIGMA", SIGMA, sep="\n")
# 奇异值分解
U, S, V = np.linalg.svd(SIGMA)
print("U", U, sep="\n")
# 主成分特征矩阵
U_reduce = U[:, 0]
print("U_reduce", U_reduce, sep="\n")
# 降维样本
Z = X * U_reduce
print("Z", Z, sep="\n")
# 恢复到均值极差变换转换之后
X_approx = Z * U_reduce.T
print("X_approx", X_approx, sep="\n")
# 恢复到原始样本
A_approx = np.multiply(X_approx, s) + mu
print("A_approx", A_approx, sep="\n")

2. sklearn的PCA接口

N -> K(K<N)
import sklearn.decomposition as dc
model = dc.PCA(K)
pca_x = model.fit_transform(x)

model.fit(x) # U_reduce的创建
pca_x = model.transform(x) # Z的返回
ipca_x = model.inverse_transform(pca_x) # X_approx
model.explained_variance_ratio_.sum() -> 还原率[0, 1]
0 < -还原率->1
误差大 误差小
代码:sk.py

import numpy as np
import sklearn.decomposition as dc
import sklearn.preprocessing as sp
import sklearn.pipeline as pl

# 原始样本
A = np.mat("3 2000; 2 3000; 4 5000; 5 8000; 1 2000", dtype=float)
print("A", A, sep="\n")
# PCA模型
model = pl.Pipeline([
    ("MinMaxScaler", sp.MinMaxScaler()),
    ("PCA", dc.PCA(n_components=1))
])
# 降维样本
Z = model.fit_transform(A)
print("Z", Z, sep="\n")
# 恢复原始样本
A_approx = model.inverse_transform(Z)
print("A_approx", A_approx, sep="\n")

3. 主成分分析在人脸识别中的应用

代码:

# face1.py
import sklearn.datasets as sd
import matplotlib.pyplot as mp

faces = sd.fetch_olivetti_faces()
x = faces.data
y = faces.target

mp.figure("Olivetti Faces", facecolor="black")
mp.subplots_adjust(left=0.04, bottom=0, right=0.98, top=0.96, wspace=0, hspace=0) # 设置子图布局
rows, cols = 10, 40
for row in range(rows):
    for col in range(cols):
        mp.subplot(rows, cols, row * cols + col + 1)
        mp.title(str(col), fontsize=8, color="limegreen")
        if col == 0:
            mp.ylabel(str(row), fontsize=8, color="limegreen")
        mp.xticks(())
        mp.yticks(())
        image = x[y == col][row].reshape(64, 64)
        mp.imshow(image, cmap="gray")
mp.show()
# face2.py
import sklearn.datasets as sd
import matplotlib.pyplot as mp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm

faces = sd.fetch_olivetti_faces()
x = faces.data
y = faces.target

train_x, test_x, train_y, test_y = ms.train_test_split(x, y, test_size=0.2, random_state=7)
model = svm.SVC(class_weight="balanced")
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / len(pred_test_y))
print(sm.classification_report(test_y, pred_test_y))
cm = sm.confusion_matrix(test_y, pred_test_y)

mp.figure("Confusion Matrix", facecolor="lightgray")
mp.title("Confusion Matrix", fontsize=20)
mp.xlabel("Predicted Class", fontsize=14)
mp.ylabel("True Class", fontsize=14)
mp.tick_params(labelsize=10)
mp.imshow(cm, interpolation="nearest", cmap="gray")

mp.show()
# face3.py
import sklearn.datasets as sd
import matplotlib.pyplot as mp
import sklearn.decomposition as dc

faces = sd.fetch_olivetti_faces()
x = faces.data
y = faces.target
ncps = range(10, 410, 10)
evrs = []
for ncp in ncps:
    model = dc.PCA(n_components=ncp)
    model.fit_transform(x)
    # 还原率,越接近1误差越低
    evr = model.explained_variance_ratio_.sum()
    evrs.append(evr)

# 画还原率曲线
mp.figure("Explained Variance Ratio", facecolor="lightgray")
mp.title("Explained Variance Ratio", fontsize=20)
mp.xlabel("n_components", fontsize=14)
mp.ylabel("Explained Variance Ratio", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(ncps, evrs, c="dodgerblue", label="Explained Variance Ratio")
mp.legend()

mp.show()
# face4.py
import sklearn.datasets as sd
import matplotlib.pyplot as mp
import sklearn.decomposition as dc

faces = sd.fetch_olivetti_faces()
x = faces.data
y = faces.target

# 画还原后的图片
mp.figure("Explained Variance Ratio", facecolor="black")
mp.subplots_adjust(left=0.04, bottom=0, right=0.98, top=0.96, wspace=0, hspace=0)
rows, cols = 11, 40
for row in range(rows):
    if row > 0:
        ncp = 140 - (row - 1) * 15
        model = dc.PCA(n_components=ncp)
        model.fit(x)
    for col in range(cols):
        mp.subplot(rows, cols, row * cols + col + 1)
        mp.title(str(col), fontsize=8, color="limegreen")
        if col == 0:
            mp.ylabel(str(ncp) if row > 0 else "orig", fontsize=8, color="limegreen")
        mp.xticks(())
        mp.yticks(())
        if row > 0:
            pca_x = model.transform([x[y == col][0]]) # 参数必须二维
            ipca_x = model.inverse_transform(pca_x)
            image = ipca_x.reshape(64, 64)
        else:
            image = x[y == col][0].reshape(64, 64)
        mp.imshow(image, cmap="gray")

mp.show()
# face5.py
import sklearn.datasets as sd
import matplotlib.pyplot as mp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import sklearn.decomposition as dc

faces = sd.fetch_olivetti_faces()
x = faces.data
y = faces.target

model = dc.PCA(n_components=140)
pca_x = model.fit_transform(x)

train_x, test_x, train_y, test_y = ms.train_test_split(pca_x, y, test_size=0.2, random_state=7)
model = svm.SVC(class_weight="balanced")
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / len(pred_test_y))
print(sm.classification_report(test_y, pred_test_y))
cm = sm.confusion_matrix(test_y, pred_test_y)

mp.figure("Confusion Matrix", facecolor="lightgray")
mp.title("Confusion Matrix", fontsize=20)
mp.xlabel("Predicted Class", fontsize=14)
mp.ylabel("True Class", fontsize=14)
mp.tick_params(labelsize=10)
mp.imshow(cm, interpolation="nearest", cmap="gray")

mp.show()

4. 核主成分分析(KPCA)

machine_learning

对于在N维空间不可线性分割的样本,通过核函数升维到更高的维度空间,再通过主成分分析,在投射误差最小的前提下,降到n维空间,即寻找可线性分割的投影面,达到简化分类模型的目的
代码:kpca.py

import sklearn.datasets as sd
import sklearn.decomposition as dc
import matplotlib.pyplot as mp

x, y = sd.make_circles(n_samples=500, factor=0.2, noise=0.04)

model = dc.KernelPCA(kernel="rbf", gamma=10, fit_inverse_transform=True) # rbf 径向基函数,高斯曲线;gamma 核函数的高; fit..=True会先升再降
kpca_x = model.fit_transform(x)

mp.figure("Original", facecolor="lightgray")
mp.title("Original", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x[:, 0], x[:, 1], s=60, c=y, cmap="brg", alpha=0.5)

mp.figure("KPCA", facecolor="lightgray")
mp.title("KPCA", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid()
mp.scatter(kpca_x[:, 0], kpca_x[:, 1], s=60, c=y, cmap="brg", alpha=0.5)

mp.show()

十五、神经网络

1. 神经元

machine_learning

权重:过滤输入的信息,针对不同的数据提高或者降低其作用和影响。[w1, w2, …, wn]
偏值:当没有任何输入时的输出。b
激活函数:将线性的连续的输入转换为非线性的离散的输出。sigmoid/tanh/relu…

2. 层

machine_learning

每一层可以由一到多个神经元组成,层中的神经元接收上一层的输出,并为下一层提供输入。数据只能在相邻层之间传递,不能跨层传输。

3. 多层神经网络

输入层:接收输入样本的各个特征,传递给第一个隐藏层,本身不对数据进行运算。
隐藏层:0到多个,通过权重、偏值和激活函数,对所接收到来自上一层的数据进行运算: O = f ( I × W + b ) O = f(I \times W + b) O=f(I×W+b)
输出层:功能和隐藏层相同,将计算的结果作为输出的每一个特征。如果隐藏层的层数多余一层,则可以被称为深度神经网络,通常使用的深度神经网络其隐藏层数可以多达数十甚至上百层,基于这样结构的学习模型被称为深度学习。

4. 最简单的神经网络:感知器

只由输入层和输出层组成的神经网络。
代码:neuron.py

import numpy as np
import neurolab as nl
import matplotlib.pyplot as mp

x = np.array([
    [0.3, 0.2],
    [0.1, 0.4],
    [0.4, 0.6],
    [0.9, 0.5]
])
y = np.array([
    [0],
    [0],
    [0],
    [1]
])

# nl.net.newp(输入范围, 输出个数)
model = nl.net.newp([[0, 1], [0, 1]], 1)
# epochs最大修正批次, show做几个批次就显示修正误差, lr学习率
# error 数组,记录每一个批次的误差
error = model.train(x, y, epochs=50, show=1, lr=0.01)

mp.figure("Neuron", facecolor="lightgray")
mp.title("Neuron", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x[:, 0], x[:, 1], c=y.ravel(), cmap="brg", label="Training")
mp.legend()

mp.figure("Training Process")
mp.title("Training Process", fontsize=20)
mp.xlabel("Epoch", fontsize=14)
mp.ylabel("Error", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(error, "o-", c="orangered", label="Error")
mp.legend()

mp.show()

5. 单层多输出神经网络

代码:mono.py

import numpy as np
import neurolab as nl
import matplotlib.pyplot as mp

data = np.loadtxt("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/mono.txt") # 默认分隔符为空格
train_x, train_y = data[:, :2], data[:, 2:]
train_labels = []
for train_row in train_y:
    train_row = train_row.astype(int).astype(str)
    train_labels.append(".".join(train_row))
label_set = np.unique(train_labels)
train_codes = []
for train_label in train_labels:
    train_code = np.where(label_set == train_label)[0][0]
    train_codes.append(train_code)
train_codes = np.array(train_codes)

model = nl.net.newp([[train_x[:, 0].min(), train_y[:, 0].max()], [train_x[:, 1].min(), train_x[:, 1].max()]], 2)
error = model.train(train_x, train_y, epochs=10, show=1, lr=0.01)
test_x = np.array([
    [0.3, 4.5],
    [4.5, 0.5],
    [4.3, 8.0],
    [6.5, 3.5]
])
pred_test_y = model.sim(test_x)
pred_test_labels = []
for pred_test_row in pred_test_y:
    pred_test_row = pred_test_row.astype(int).astype(str)
    pred_test_labels.append(".".join(pred_test_row))
pred_test_codes = []
for pred_test_label in pred_test_labels:
    pred_test_code = np.where(label_set == pred_test_label)[0][0]
    pred_test_codes.append(pred_test_code)
pred_test_codes = np.array(pred_test_codes)

mp.figure("MonoLayer Neural Network", facecolor="lightgray")
mp.title("MonoLayer Neural Network", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(train_x[:, 0], train_x[:, 1], c=train_codes, cmap="brg", s=60, label="Training")
mp.scatter(test_x[:, 0], test_x[:, 1], c=pred_test_codes, cmap="brg", s=60, label="Training", marker="^")
mp.legend()

mp.figure("Training Process", facecolor="lightgray")
mp.title("Training Process", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(error, "o-", c="orangered", label="Error")
mp.legend()

mp.show()

6. 深度(两个隐藏层)神经网络

代码:deep.py

import numpy as np
import neurolab as nl
import matplotlib.pyplot as mp

train_x = np.linspace(-10, 10, 100)
train_y = 2 * np.square(train_x) + 7
train_y /= np.linalg.norm(train_y)
train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)
# nl.net.newff([输入数据值域], [隐藏层神经元数, ..., 输出层输出个数])
model = nl.net.newff([[train_x.min(), train_x.max()]], [10, 10, 1])
# 训练函数, 梯度下降法
model.trainf = nl.train.train_gd
error = model.train(train_x, train_y, epochs=800, show=20, goal=0.01) # goal目标误差
test_x = np.linspace(-10, 10, 1000)
test_x = test_x.reshape(-1, 1)
pred_test_y = model.sim(test_x)

mp.figure("Deep Neural Network", facecolor="lightgray")
mp.title("Deep Neural Network", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(train_x, train_y, c="dodgerblue", label="Training")
mp.plot(test_x, pred_test_y, c="limegreen", label="Testing")
mp.legend()

mp.figure("Training Process", facecolor="lightgray")
mp.title("Training Process", fontsize=20)
mp.xlabel("x", fontsize=14)
mp.ylabel("y", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(error, c="orangered", label="Error")
mp.legend()

mp.show()

7. OCR识别

代码:ocrdb.py、ocr.py

import numpy as np
import cv2 as cv

with open("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/ocrdb.dat", "r") as f:
    for line in f.readlines():
        items = line.split("\t")
        char, image = items[1], items[6:-1]
        image = np.array(image, dtype=np.uint8)
        image *= 255
        image = image.reshape(16, 8)
        image = cv.resize(image, None, fx=25, fy=25) # 放大图片
        cv.imshow(char, image)
        if cv.waitKey(100) == 27:
            break
import numpy as np
import neurolab as nl
import matplotlib.pyplot as mp

charset = "omandig"
x, y = [], []
with open("C:/Users/Hasee/Desktop/blog_data-master/machine_learning_date/ocrdb.dat", "r") as f:
    for line in f.readlines():
        items = line.split("\t")
        char, image = items[1], items[6:-1]
        if char in charset:
            code = np.zeros(len(charset), dtype=int)
            code[charset.index(char)] = 1
            y.append(code)
            x.append(np.array(image, dtype=int))
            if len(x) >= 30:
                break
x = np.array(x)
y = np.array(y)

train_size = int(len(x) * 0.8)
train_x, test_x = x[:train_size], x[train_size:]
train_y, test_y = y[:train_size], y[train_size:]
input_ranges = []
for _ in x.T:
    input_ranges.append([0, 1])

model = nl.net.newff(input_ranges, [128, 16, y.shape[1]])
model.trainf = nl.train.train_gd
error = model.train(train_x, train_y, epochs=10000, show=100, goal=0.01)

pred_test_y = model.sim(test_x)
def decode(codes):
    return "".join(charset[code.argmax()] for code in codes)
true_string = decode(test_y)
pred_string = decode(pred_test_y)
print(true_string, "->", pred_string)
axes = mp.subplots(1, len(test_x), num="OCR", facecolor="lightgray")[1]
for ax, char_image, true_char, pred_char in zip(axes, test_x, true_string, pred_string):
    ax.matshow(char_image.reshape(16, 8), cmap="brg")
    ax.set_title("{}{}{}".format(true_char, "==" if true_char == pred_char else "!=", pred_char), fontsize=16)
    ax.set_xticks(())
    ax.set_yticks(())

mp.figure("Training Process")
mp.title("Training Process", fontsize=20)
mp.xlabel("Epoch", fontsize=14)
mp.ylabel("Error", fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(error, c="orangered", label="Error")
mp.legend()

mp.show()

十六、推荐书目

基础:
scikit-learn机器学习:常用算法原理和编程实践,黄永昌主编,机械工业出版社
机器学习算法原理与编程实践,郑洁,电子工业出版社
进阶:
深度学习,张鹏主编,电子工业出版社
TensorFlow机器学习项目实战,姚鹏鹏译,人民邮电出版社
休闲:
数学之美,吴军,人民邮电出版社
终极算法,黄芳平译,中国电信出版社
深度学习,伊恩古德弗洛著,人民邮电出版社

上一篇:JS 数据可视化


下一篇:JAVA学习-GUI编程之AWT