主成分分析(PCA)
Principal Component Analysis
无监督问题
用途:降维中最常用的一种手段
目标:提取最有价值的信息(基于方差)
问题:降维后的数据的意义?
向量的表示
内积: ( a 1 , a 2 , ⋯ , a n ) ⊤ ⋅ ( b 1 , b 2 , ⋯ , b n ) ⊤ = a 1 b 1 + a 2 b 2 + ⋯ + a n b n \left(a_{1}, a_{2}, \cdots, a_{n}\right)^{\top} \cdot\left(b_{1}, b_{2}, \cdots, b_{n}\right)^{\top}=a_{1} b_{1}+a_{2} b_{2}+\cdots+a_{n} b_{n} (a1,a2,⋯,an)⊤⋅(b1,b2,⋯,bn)⊤=a1b1+a2b2+⋯+anbn
解释: A ⋅ B = ∣ A ∣ ∣ B ∣ cos ( a ) A \cdot B=|A||B| \cos (a) A⋅B=∣A∣∣B∣cos(a)
设向量B的模为1,则A与B的内积值等于A向B所在直线投影的矢量长度
向量可以表示为(3,2),实际上表示线性组合: x ( 1 , 0 ) ⊤ + y ( 0 , 1 ) ⊤ x(1,0)^{\top}+y(0,1)^{\top} x(1,0)⊤+y(0,1)⊤
基:(1,0)和(0,1)叫做二维空间中的一组基
基变换
基是正交的(即内积为0,或直观说相互垂直)要求:线性无关
变换:数据与一个基做内积运算,结果作为第一个新的坐标分量,然后与第二个基做内积运算,结果作为第二个新坐标的分量
数据(3,2)映射到基中坐标: ( 1 / 2 1 / 2 − 1 / 2 1 / 2 ) ( 3 2 ) = ( 5 / 2 − 1 / 2 ) \left(\begin{array}{cc} 1 / \sqrt{2} & 1 / \sqrt{2} \\ -1 / \sqrt{2} & 1 / \sqrt{2} \end{array}\right)\left(\begin{array}{l} 3 \\ 2 \end{array}\right)=\left(\begin{array}{c} 5 / \sqrt{2} \\ -1 / \sqrt{2} \end{array}\right) (1/2 −1/2 1/2 1/2 )(32)=(5/2 −1/2 )
( p 1 p 2 ⋮ p R ) ( a 1 a 2 ⋯ a M ) = ( p 1 a 1 p 1 a 2 ⋯ p 1 a M p 2 a 1 p 2 a 2 ⋯ p 2 a M ⋮ ⋮ ⋱ ⋮ p R a 1 p R a 2 ⋯ p R a M ) \left(\begin{array}{c} p_{1} \\ p_{2} \\ \vdots \\ p_{R} \end{array}\right)\left(\begin{array}{llll} a_{1} & a_{2} & \cdots & a_{M} \end{array}\right)=\left(\begin{array}{cccc} p_{1} a_{1} & p_{1} a_{2} & \cdots & p_{1} a_{M} \\ p_{2} a_{1} & p_{2} a_{2} & \cdots & p_{2} a_{M} \\ \vdots & \vdots & \ddots & \vdots \\ p_{R} a_{1} & p_{R} a_{2} & \cdots & p_{R} a_{M} \end{array}\right) ⎝⎜⎜⎜⎛p1p2⋮pR⎠⎟⎟⎟⎞(a1a2⋯aM)=⎝⎜⎜⎜⎛p1a1p2a1⋮pRa1p1a2p2a2⋮pRa2⋯⋯⋱⋯p1aMp2aM⋮pRaM⎠⎟⎟⎟⎞
两个矩阵相乘的意义是将右边矩阵中的每一列列向量变换到左边矩阵中每一行行向量为基所表示的空间中去
协方差矩阵
方向:如何选择这个方向(或者说基)才能尽量保留最多的原始信息呢?一种直观的看法是:希望投影后的投影值尽可能分散
方差: Var ( a ) = 1 m ∑ i = 1 m ( a i − μ ) 2 \operatorname{Var}(a)=\frac{1}{m} \sum_{i=1}^{m}\left(a_{i}-\mu\right)^{2} Var(a)=m1∑i=1m(ai−μ)2
寻找一个一维基,使得所有数据变换为这个基上的坐标表示后,方差值最大
协方差(假设均值为0时): Cov ( a , b ) = 1 m ∑ i = 1 m a i b i \operatorname{Cov}(a, b)=\frac{1}{m} \sum_{i=1}^{m} a_{i} b_{i} Cov(a,b)=m1∑i=1maibi
如果单纯只选择方差最大的方向,后续方向应该会和方差最大的方向接近重合。
解决方案:为了让两个字段尽可能表示更多的原始信息,我们是不希望它们之间存在(线性)相关性的
协方差:可以用两个字段的协方差表示其相关性 Cov ( a , b ) = 1 m ∑ i = 1 m a i b i \operatorname{Cov}(a, b)=\frac{1}{m} \sum_{i=1}^{m} a_{i} b_{i} Cov(a,b)=m1∑i=1maibi
当协方差为0时,表示两个字段完全独立。为了让协方差为0,选择第二个基时,只能在与第一个基正交的方向上选择。因此最终选择的两个方向一定是正交的。
优化目标
将一组N维向量降为K维(K大于0,小于N),目标是选择K个单位正交基,使原始数据变换到这组基上后,各字段两两间协方差为0,字段的方差则尽可能大
协方差矩阵: X = ( a 1 a 2 ⋯ a m b 1 b 2 ⋯ b m ) 1 m X X ⊤ = ( 1 m ∑ i = 1 m a i 2 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m b i 2 ) X=\left(\begin{array}{cccc} a_{1} & a_{2} & \cdots & a_{m} \\ b_{1} & b_{2} & \cdots & b_{m} \end{array}\right) \quad \frac{1}{m} X X^{\top}=\left(\begin{array}{cc} \frac{1}{m} \sum_{i=1}^{m} a_{i}^{2} & \frac{1}{m} \sum_{i=1}^{m} a_{i} b_{i} \\ \frac{1}{m} \sum_{i=1}^{m} a_{i} b_{i} & \frac{1}{m} \sum_{i=1}^{m} b_{i}^{2} \end{array}\right) X=(a1b1a2b2⋯⋯ambm)m1XX⊤=(m1∑i=1mai2m1∑i=1maibim1∑i=1maibim1∑i=1mbi2)
矩阵对角线上的两个元素分别是两个字段的方差,而其它元素是a和b的协方差。
协方差矩阵对角化:即除对角线外的其它元素化为0,并且在对角线上将元素按大小从上到下排列
协方差矩阵对角化: P C P ⊤ = Λ = ( λ 1 λ 2 ⋱ λ n ) P C P^{\top}=\Lambda=\left(\begin{array}{llll} \lambda_{1} & & & \\ & \lambda_{2} & & \\ & & \ddots & \\ & & & \lambda_{n} \end{array}\right) PCP⊤=Λ=⎝⎜⎜⎛λ1λ2⋱λn⎠⎟⎟⎞
实对称矩阵:一个n行n列的实对称矩阵一定可以找到n个单位正交特征向量
E = ( e 1 e 2 ⋯ e n ) E=\left(\begin{array}{llll} e_{1} & e_{2} & \cdots & e_{n} \end{array}\right) E=(e1e2⋯en)
实对称阵可进行对角化: E ⊤ C E = Λ = ( λ 1 λ 2 ⋱ λ n ) E^{\top} C E=\Lambda=\left(\begin{array}{llll} \lambda_{1} & & & \\ & \lambda_{2} & & \\ & & \ddots & \\ & & & \lambda_{n} \end{array}\right) E⊤CE=Λ=⎝⎜⎜⎛λ1λ2⋱λn⎠⎟⎟⎞
根据特征值的从大到小,将特征向量从上到下排列,则用前K行组成的矩阵乘以原始数据矩阵X,就得到了我们需要的降维后的数据矩阵Y
PCA实例
数据: ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) \left(\begin{array}{ccccc} -1 & -1 & 0 & 2 & 0 \\ -2 & 0 & 0 & 1 & 1 \end{array}\right) (−1−2−10002101)
协方差矩阵: C = 1 5 ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) ( − 1 − 2 − 1 0 0 0 2 1 0 1 ) = ( 6 5 4 5 4 5 6 5 ) C=\frac{1}{5}\left(\begin{array}{ccccc} -1 & -1 & 0 & 2 & 0 \\ -2 & 0 & 0 & 1 & 1 \end{array}\right)\left(\begin{array}{cc} -1 & -2 \\ -1 & 0 \\ 0 & 0 \\ 2 & 1 \\ 0 & 1 \end{array}\right) =\left(\begin{array}{cc} \frac{6}{5} & \frac{4}{5} \\ \frac{4}{5} & \frac{6}{5} \end{array}\right) C=51(−1−2−10002101)⎝⎜⎜⎜⎜⎛−1−1020−20011⎠⎟⎟⎟⎟⎞=(56545456)
特征值: λ 1 = 2 , λ 2 = 2 / 5 \lambda_{1}=2, \lambda_{2}=2 / 5 λ1=2,λ2=2/5 特征向量 : c 1 ( 1 1 ) , c 2 ( − 1 1 ) \text { 特征向量 : } c_{1}\left(\begin{array}{l} 1 \\ 1 \end{array}\right), c_{2}\left(\begin{array}{c} -1 \\ 1 \end{array}\right) 特征向量 : c1(11),c2(−11)
对角化: P C P ⊤ = ( 1 / 2 1 / 2 − 1 / 2 1 / 2 ) ( 6 / 5 4 / 5 4 / 5 6 / 5 ) ( 1 / 2 − 1 / 2 1 / 2 1 / 2 ) = ( 2 0 0 2 / 5 ) P C P^{\top}=\left(\begin{array}{cc} 1 / \sqrt{2} & 1 / \sqrt{2} \\ -1 / \sqrt{2} & 1 / \sqrt{2} \end{array}\right)\left(\begin{array}{cc} 6 / 5 & 4 / 5 \\ 4 / 5 & 6 / 5 \end{array}\right)\left(\begin{array}{cc} 1 / \sqrt{2} & -1 / \sqrt{2} \\ 1 / \sqrt{2} & 1 / \sqrt{2} \end{array}\right)=\left(\begin{array}{cc} 2 & 0 \\ 0 & 2 / 5 \end{array}\right) PCP⊤=(1/2 −1/2 1/2 1/2 )(6/54/54/56/5)(1/2 1/2 −1/2 1/2 )=(2002/5)
降维: Y = ( 1 / 2 1 / 2 ) ( − 1 − 1 0 2 0 − 2 0 0 1 1 ) = ( − 3 / 2 − 1 / 2 0 3 / 2 − 1 / 2 ) Y=\left(\begin{array}{llllll} 1 / \sqrt{2} & 1 / \sqrt{2} \end{array}\right)\left(\begin{array}{ccccc} -1 & -1 & 0 & 2 & 0 \\ -2 & 0 & 0 & 1 & 1 \end{array}\right)=\left(\begin{array}{lllll} -3 / \sqrt{2} & -1 / \sqrt{2} & 0 & 3 / \sqrt{2} & -1 / \sqrt{2} \end{array}\right) Y=(1/2 1/2 )(−1−2−10002101)=(−3/2 −1/2 03/2 −1/2 )
PCA案例
import numpy as np
import pandas as pd
df = pd.read_csv('iris.data')
df.head()
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.head()
# split data table into data X and class labels y
X = df.ix[:,0:4].values
y = df.ix[:,4].values
from matplotlib import pyplot as plt
import math
label_dict = {1: 'Iris-Setosa',
2: 'Iris-Versicolor',
3: 'Iris-Virgnica'}
feature_dict = {0: 'sepal length [cm]',
1: 'sepal width [cm]',
2: 'petal length [cm]',
3: 'petal width [cm]'}
plt.figure(figsize=(8, 6))
for cnt in range(4):
plt.subplot(2, 2, cnt+1)
for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
plt.hist(X[y==lab, cnt],
label=lab,
bins=10,
alpha=0.3,)
plt.xlabel(feature_dict[cnt])
plt.legend(loc='upper right', fancybox=True, fontsize=8)
plt.tight_layout()
plt.show()
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
print (X_std)
# 得到协方差矩阵
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)
Covariance matrix
[[ 1.00675676 -0.10448539 0.87716999 0.82249094]
[-0.10448539 1.00675676 -0.41802325 -0.35310295]
[ 0.87716999 -0.41802325 1.00675676 0.96881642]
[ 0.82249094 -0.35310295 0.96881642 1.00675676]]
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))
NumPy covariance matrix:
[[ 1.00675676 -0.10448539 0.87716999 0.82249094]
[-0.10448539 1.00675676 -0.41802325 -0.35310295]
[ 0.87716999 -0.41802325 1.00675676 0.96881642]
[ 0.82249094 -0.35310295 0.96881642 1.00675676]]
# 分解特征值,特征向量
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
Eigenvectors
[[ 0.52308496 -0.36956962 -0.72154279 0.26301409]
[-0.25956935 -0.92681168 0.2411952 -0.12437342]
[ 0.58184289 -0.01912775 0.13962963 -0.80099722]
[ 0.56609604 -0.06381646 0.63380158 0.52321917]]
Eigenvalues
[2.92442837 0.93215233 0.14946373 0.02098259]
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print (eig_pairs)
print ('----------')
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
[(2.9244283691111135, array([ 0.52308496, -0.25956935, 0.58184289, 0.56609604])), (0.9321523302535066, array([-0.36956962, -0.92681168, -0.01912775, -0.06381646])), (0.1494637348981336, array([-0.72154279, 0.2411952 , 0.13962963, 0.63380158])), (0.02098259276427038, array([ 0.26301409, -0.12437342, -0.80099722, 0.52321917]))]
----------
Eigenvalues in descending order:
2.9244283691111135
0.9321523302535066
0.1494637348981336
0.02098259276427038
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print (var_exp)
cum_var_exp = np.cumsum(var_exp)
# 每一个特征值都可以代表重要程度
cum_var_exp
[72.6200333269203, 23.14740685864414, 3.7115155645845284, 0.5210442498510098]
array([ 72.62003333, 95.76744019, 99.47895575, 100. ])
a = np.array([1,2,3,4])
print (a)
print ('-----------')
# cumsum是每一个值是前面的值加在一起的
print (np.cumsum(a))
[1 2 3 4]
-----------
[ 1 3 6 10]
plt.figure(figsize=(6, 4))
plt.bar(range(4), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
# 提取最大的特征值的特征向量
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)
Matrix W:
[[ 0.52308496 -0.36956962]
[-0.25956935 -0.92681168]
[ 0.58184289 -0.01912775]
[ 0.56609604 -0.06381646]]
# 变换新的维度
Y = X_std.dot(matrix_w)
Y
# 没做PCA变换,效果很差
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(X[y==lab, 0],
X[y==lab, 1],
label=lab,
c=col)
plt.xlabel('sepal_len')
plt.ylabel('sepal_wid')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
# PCA变换之后
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
plt.show()