Lecture16 Recommender Systems
Problem formulation
Example:Predicting movie ratings
User rates movies using zero to five stars
\(n_u\) = no. users
\(n_m\) = no.movies
\(r(i,j)\) = 1 if user \(j\) has rated movie \(i\)
\(y(i,j)\) = rating given by user \(j\) to movie \(i\) (defined only if \(r(i,j) = 1\))
Content-based recommendations
Content-based recommender systems
For each user \(j\) ,learn a parameter \(\theta^{(j)}\in\mathbb{R}^3\).Predict user \(j\) as rating movies \((\theta^{(j)})^Tx^{(i)}\)
\(\theta^{(j)}\) = parameter vector for user \(j\) (\(\theta^{(j)}\in\mathbb{R}^{n+1}\))
\(x^{(i)}\) = feature vector for movie \(i\)
For user \(j\) , movie \(i\) ,predicted rating : \((\theta^{(j)})^Tx^{(i)}\)
\(m^{(j)}\) = no. of movies rated by user \(j\)
上图样本中的电影分为两种,前三部是爱情电影,后两部为动作电影。假设电影有两个特征值,第三个电影的特征值为\(x^{(3)}=\left[\begin{matrix}1\\0.9\\0\end{matrix}\right]\),其中\(x^{(3)}_0\)为偏置,0.9表明该电影偏向爱情电影。用户1对两部爱情电影打分5,对两部动作电影打分0,通过学习得到的用户1的\(\theta^{(3)}=\left[\begin{matrix}0\\5\\0\end{matrix}\right]\)。\((\theta^{(1)})^Tx^{(3)}=5*0.99=4.95\)。所以预测用户1给第三部电影打分为4.95分
Optimization objective
To learn \(\theta^{(j)}\) (parameter for user \(j\)):
\[\underset{\theta^{(j)}}{min}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]To learn \(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\)
\[\underset{\theta^{(1)},\dots,\theta^{(n_u)}}{min}\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]Gradient descent update:
\[\theta_k^{(j)}=\theta_k^{(j)} - \alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}\ \ \ \ (for\ \ k=0) \] \[\theta_k^{(j)}=\theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)})\ \ \ \ (for\ \ k\neq0) \]Collaborative filtering
协同过滤。该算法有一个有趣的特征,叫做特征学习,即这种算法能够自行学习所需要使用的特征。
problem motivation
对于该数据集,我们不知道每个电影的特征值,但是知道每个用户的喜好,即图中的\(\theta\)表示。我们可以由此推测出每个电影的特征值。
\[(\theta^{(1)})^Tx^{(1)} \approx 5 \\ (\theta^{(2)})^Tx^{(1)} \approx 5 \\ (\theta^{(3)})^Tx^{(1)} \approx 0 \\ (\theta^{(4)})^Tx^{(1)} \approx 0 \]由此可知\(x^{(1)}=\left[\begin{matrix}1\\1\\0\end{matrix}\right]\)。其中\(x_0^{(1)}\)是截距项。
Optimization algorithm
Given \(\theta^{(1)},\dots,\theta^{(n_u)}\) , to learn \(x^{(i)}\)
\[\underset{x^{(i)}}{min}\frac{1}{2}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(x^{(i)}_k)^2 \]Given \(\theta^{(1)},\dots,\theta^{(n_u)}\) , to learn \(x^{(1)},\dots,x^{(n_m)}\)
\[\underset{x^{(i)},\dots,x^{(n_m)}}{min}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2 \]Collaborative filtering
Given \(x^{(1)},\dots,x^{(n_m)}\) (and movies ratings),can estimate \(\theta^{(1)},\dots,\theta^{(n_m)}\)
Given \(\theta^{(1)},\dots,\theta^{(n_m)}\) (and movies ratings),can estimate \(x^{(1)},\dots,x^{(n_m)}\)
随机猜一些\(\theta\),学习出不同电影的特征,再用通过特征来获得一个更好的对参数\(\theta\)的估计,再通过\(\theta\)来获得更好的特征,不断进行迭代,最后收敛到合理的电影特征和合理的对不同用户的参数的估计。
这是基础的协同过滤算法。对于该问题,对于推荐系统来说,该算法得以实现的基础是每个用户都对数个电影进行了评价,并且每个电影都被数位用户评价过,才能重复这个迭代过程来估计 \(x\) 和 \(\theta\)。
当你运行算法时,需要观察大量的用户,观察这些用户的实际动作来协同获得更佳的每个人对电影的评分。
每个用户都对电影做出了评价,每个用户都在帮助算法学习出更合适的特征。学习出来的特征又可以被用来更好地预测其他用户的评分。这是协同的另一个意思,是说每个用户都在帮助算法更好地进行特征学习。
Collaborative filtering algorithm
Collaborative filtering optimization objective
Minimizing \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) simultaneously
\[J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)})=\frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]Collaborative filtering algorithm
-
Initialize \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) to small random values
-
Minimize \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) using gradient descent (or an advanced optimization algorithm) E.g. for every \(j = 1,\dots,n_u,i = 1,\dots,n_m\)
\[x_k^{(j)}=x_k^{(j)} - \alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(j)})\\ \theta_k^{(j)}=\theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) \] -
for a user with parameters \(\theta\) and movie with (learned) features \(x\) ,predict a star rating of \(\theta^Tx\)
Vectorization:Low rank matrix factorization
Collaborative filtering
用户对电影的评分用矩阵Y进行表示。
预测评分矩阵:
\[X\Theta^T=\left[ \begin{matrix} --(X^{(1)})^T-- \\ --(X^{(2)})^T-- \\ \vdots \\ --(X^{(n_m)})^T-- \\ \end{matrix} \right] \left[ \begin{matrix} --(\theta^{(1)})^T-- \\ --(\theta^{(2)})^T-- \\ \vdots \\ --(\theta^{(n_u)})^T-- \\ \end{matrix} \right]^T =\left[ \begin{matrix} (\theta^{(1)})^Tx^{(1)} & (\theta^{(2)})^Tx^{(1)} & \cdots & (\theta^{(n_u)})^Tx^{(1)} \\ (\theta^{(1)})^Tx^{(2)} & (\theta^{(2)})^Tx^{(2)} & \cdots & (\theta^{(n_u)})^Tx^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ (\theta^{(1)})^Tx^{(n_m)} & (\theta^{(2)})^Tx^{(n_m)} & \cdots & (\theta^{(n_u)})^Tx^{(n_m)} \end{matrix} \right] \]finding related movies
For each product \(i\),we learn a feature vector \(x^{(i)} \in \mathbb{R}^n\)
How to find movies related to movie \(i\)?
small \(\left \|x^{(i)} - x^{(j)} \right \|\) means that movie \(j\) and movie \(i\) are "similar"
5 most similar movies to movie \(i\) : Find the 5 movies j with the smallest \(\left \|x^{(i)} - x^{(j)} \right \|\)
Implementational detail : Mean normalization
Users who have not rated any movies
学习5号用户Eve的参数向量\(\theta^{(5)}\)。用户Eve没有评价过任何电影,所以只有\(\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2\)这一项对损失函数有影响。为了最小化损失函数,\(\theta^{(5)}=\left[\begin{matrix}0\\0\\0\end{matrix}\right]\)。所以\((\theta^{(5)})^Tx^{(1)} = 0\),即Eve对每个电影打的分数都是0分。这种结果是存在问题的。
Mean normalization
计算出每部电影的平均值,Y减去均值。用新的Y来训练\(\theta^{(i)},x^{(i)}\)。因为用户5仍然没有评价过任何电影,所以问题仍然存在
对于用户 \(j\) 评价电影 \(i\) 时,预测值再加上均值 \(u_i\)。即使预测值为0,得到就是电影的平均分。