Lecture16 Recommender Systems

Lecture16 Recommender Systems

Problem formulation

Example:Predicting movie ratings

User rates movies using zero to five stars

\(n_u\) = no. users

\(n_m\) = no.movies

\(r(i,j)\) = 1 if user \(j\) has rated movie \(i\)

\(y(i,j)\) = rating given by user \(j\) to movie \(i\) (defined only if \(r(i,j) = 1\))

Content-based recommendations

Content-based recommender systems

Lecture16 Recommender Systems

For each user \(j\) ,learn a parameter \(\theta^{(j)}\in\mathbb{R}^3\).Predict user \(j\) as rating movies \((\theta^{(j)})^Tx^{(i)}\)

\(\theta^{(j)}\) = parameter vector for user \(j\) (\(\theta^{(j)}\in\mathbb{R}^{n+1}\))
\(x^{(i)}\) = feature vector for movie \(i\)

For user \(j\) , movie \(i\) ,predicted rating : \((\theta^{(j)})^Tx^{(i)}\)

\(m^{(j)}\) = no. of movies rated by user \(j\)

上图样本中的电影分为两种,前三部是爱情电影,后两部为动作电影。假设电影有两个特征值,第三个电影的特征值为\(x^{(3)}=\left[\begin{matrix}1\\0.9\\0\end{matrix}\right]\),其中\(x^{(3)}_0\)为偏置,0.9表明该电影偏向爱情电影。用户1对两部爱情电影打分5,对两部动作电影打分0,通过学习得到的用户1的\(\theta^{(3)}=\left[\begin{matrix}0\\5\\0\end{matrix}\right]\)。\((\theta^{(1)})^Tx^{(3)}=5*0.99=4.95\)。所以预测用户1给第三部电影打分为4.95分

Optimization objective

To learn \(\theta^{(j)}\) (parameter for user \(j\)):

\[\underset{\theta^{(j)}}{min}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]

To learn \(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\)

\[\underset{\theta^{(1)},\dots,\theta^{(n_u)}}{min}\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]

Gradient descent update:

\[\theta_k^{(j)}=\theta_k^{(j)} - \alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}\ \ \ \ (for\ \ k=0) \]

\[\theta_k^{(j)}=\theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)})\ \ \ \ (for\ \ k\neq0) \]

Collaborative filtering

协同过滤。该算法有一个有趣的特征,叫做特征学习,即这种算法能够自行学习所需要使用的特征。

problem motivation

Lecture16 Recommender Systems

对于该数据集,我们不知道每个电影的特征值,但是知道每个用户的喜好,即图中的\(\theta\)表示。我们可以由此推测出每个电影的特征值。

\[(\theta^{(1)})^Tx^{(1)} \approx 5 \\ (\theta^{(2)})^Tx^{(1)} \approx 5 \\ (\theta^{(3)})^Tx^{(1)} \approx 0 \\ (\theta^{(4)})^Tx^{(1)} \approx 0 \]

由此可知\(x^{(1)}=\left[\begin{matrix}1\\1\\0\end{matrix}\right]\)。其中\(x_0^{(1)}\)是截距项。

Optimization algorithm

Given \(\theta^{(1)},\dots,\theta^{(n_u)}\) , to learn \(x^{(i)}\)

\[\underset{x^{(i)}}{min}\frac{1}{2}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(x^{(i)}_k)^2 \]

Given \(\theta^{(1)},\dots,\theta^{(n_u)}\) , to learn \(x^{(1)},\dots,x^{(n_m)}\)

\[\underset{x^{(i)},\dots,x^{(n_m)}}{min}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2 \]

Collaborative filtering

Given \(x^{(1)},\dots,x^{(n_m)}\) (and movies ratings),can estimate \(\theta^{(1)},\dots,\theta^{(n_m)}\)

Given \(\theta^{(1)},\dots,\theta^{(n_m)}\) (and movies ratings),can estimate \(x^{(1)},\dots,x^{(n_m)}\)

随机猜一些\(\theta\),学习出不同电影的特征,再用通过特征来获得一个更好的对参数\(\theta\)的估计,再通过\(\theta\)来获得更好的特征,不断进行迭代,最后收敛到合理的电影特征和合理的对不同用户的参数的估计。

这是基础的协同过滤算法。对于该问题,对于推荐系统来说,该算法得以实现的基础是每个用户都对数个电影进行了评价,并且每个电影都被数位用户评价过,才能重复这个迭代过程来估计 \(x\) 和 \(\theta\)。

当你运行算法时,需要观察大量的用户,观察这些用户的实际动作来协同获得更佳的每个人对电影的评分。

每个用户都对电影做出了评价,每个用户都在帮助算法学习出更合适的特征。学习出来的特征又可以被用来更好地预测其他用户的评分。这是协同的另一个意思,是说每个用户都在帮助算法更好地进行特征学习。

Collaborative filtering algorithm

Collaborative filtering optimization objective

Minimizing \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) simultaneously

\[J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)})=\frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2 \]

Collaborative filtering algorithm

  1. Initialize \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) to small random values

  2. Minimize \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots,\theta^{(n_u)}\) using gradient descent (or an advanced optimization algorithm) E.g. for every \(j = 1,\dots,n_u,i = 1,\dots,n_m\)

    \[x_k^{(j)}=x_k^{(j)} - \alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(j)})\\ \theta_k^{(j)}=\theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) \]

  3. for a user with parameters \(\theta\) and movie with (learned) features \(x\) ,predict a star rating of \(\theta^Tx\)

Vectorization:Low rank matrix factorization

Collaborative filtering

Lecture16 Recommender Systems

用户对电影的评分用矩阵Y进行表示。

预测评分矩阵:

\[X\Theta^T=\left[ \begin{matrix} --(X^{(1)})^T-- \\ --(X^{(2)})^T-- \\ \vdots \\ --(X^{(n_m)})^T-- \\ \end{matrix} \right] \left[ \begin{matrix} --(\theta^{(1)})^T-- \\ --(\theta^{(2)})^T-- \\ \vdots \\ --(\theta^{(n_u)})^T-- \\ \end{matrix} \right]^T =\left[ \begin{matrix} (\theta^{(1)})^Tx^{(1)} & (\theta^{(2)})^Tx^{(1)} & \cdots & (\theta^{(n_u)})^Tx^{(1)} \\ (\theta^{(1)})^Tx^{(2)} & (\theta^{(2)})^Tx^{(2)} & \cdots & (\theta^{(n_u)})^Tx^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ (\theta^{(1)})^Tx^{(n_m)} & (\theta^{(2)})^Tx^{(n_m)} & \cdots & (\theta^{(n_u)})^Tx^{(n_m)} \end{matrix} \right] \]

For each product \(i\),we learn a feature vector \(x^{(i)} \in \mathbb{R}^n\)

How to find movies related to movie \(i\)?

small \(\left \|x^{(i)} - x^{(j)} \right \|\) means that movie \(j\) and movie \(i\) are "similar"

5 most similar movies to movie \(i\) : Find the 5 movies j with the smallest \(\left \|x^{(i)} - x^{(j)} \right \|\)

Implementational detail : Mean normalization

Users who have not rated any movies

Lecture16 Recommender Systems

学习5号用户Eve的参数向量\(\theta^{(5)}\)。用户Eve没有评价过任何电影,所以只有\(\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2\)这一项对损失函数有影响。为了最小化损失函数,\(\theta^{(5)}=\left[\begin{matrix}0\\0\\0\end{matrix}\right]\)。所以\((\theta^{(5)})^Tx^{(1)} = 0\),即Eve对每个电影打的分数都是0分。这种结果是存在问题的。

Mean normalization

Lecture16 Recommender Systems

计算出每部电影的平均值,Y减去均值。用新的Y来训练\(\theta^{(i)},x^{(i)}\)。因为用户5仍然没有评价过任何电影,所以问题仍然存在

对于用户 \(j\) 评价电影 \(i\) 时,预测值再加上均值 \(u_i\)。即使预测值为0,得到就是电影的平均分。

上一篇:D-PHY


下一篇:正态分布与最小二乘