Learning Theory
Assumption
- data in training set and test set are from the same distribution
- all samples are sampled independently
- Learning Algorithm is deterministic function, while output parameter is a random variable(sampling distribution), but there is a "true parameter" that is fixed but unknown, and we wish to get close.
Parameter view of fitting degree
- bias is how far the estimated parameter is to the true parameter
- variance is how stable and similar when repeating the learning process
- actually first moment and second moment property
we wish when we increase the number of samples, variance tends to be 0. And if \(\theta \rightarrow \theta^{*}\),this model can be called unbiased estimator.
-
g: best possible hypothesis
-
\(h^{*}\): best in class \(\mathcal{H}\)
-
\(h\): learn from finite data
-
empirical risk
- \(\hat{\varepsilon}(h)=\frac{1}{m} \sum_{i=1}^{m} 1\left\{h\left(x^{(i)}\right) \neq y^{(i)}\right\}\)
- wrong classification rate in the training set
-
generalization error
- $\varepsilon(h)=P_{(x, y) \sim \mathcal{D}}(h(x) \neq y) $
- probability of making a mistake on a new example
-
Bayes error \(\varepsilon(g)\): irreducible mistake rate
-
Approximation error \(\varepsilon(h^{*}) - \varepsilon(g)\)
- attribution of the class
-
Estimation error \(\varepsilon(\hat{h}) - \varepsilon(h^{*})\)
-
generalization error
- = irreducible error + Approximation error + Estimation error
- = irreducible error + bias + variance
Fight variance
- increase the number of samples
- regularization(may cause bias but can reduce variance significantly)
Fight high Bias
- make class \(\mathcal{H}\) bigger
- reduce bias, increase variance
- opposite to regularization
ERM: Empirical Risk Minimizer
-
\(\hat{h}=\arg \min _{h \in \mathcal{H}} \hat{\varepsilon}(h)\)
-
Uniform convergence
- \(\hat{\varepsilon}(h)\) vs \(\varepsilon(h)\) : relationship between empirical risk and generalization erro
- \(\varepsilon(h)\) vs \(\varepsilon(h^{*})\) : how to measure the distance between our result and the best result in this class
- tools
- Union bound
- Hoeffding's Inequality
- \(P(|\phi-\hat{\phi}|>\gamma) \leq 2 \exp \left(-2 \gamma^{2} m\right)\)
- Hoeffding's Inequality is used to restrict the problem 1 for a certain hypothesis, applying this and union bound to all hypothesis is called uniform convergence
Finite hypothesis class
-
Assume that \(|\mathcal{H}| = k\)
-
\(\begin{aligned} P\left(\exists h \in \mathcal{H} \cdot\left|\varepsilon\left(h_{i}\right)-\hat{\varepsilon}\left(h_{i}\right)\right|>\gamma\right) &=P\left(A_{1} \cup \cdots \cup A_{k}\right) \\ & \leq \sum_{i=1}^{k} P\left(A_{i}\right) \\ & \leq \sum_{i=1}^{k} 2 \exp \left(-2 \gamma^{2} m\right) \\ &=2 k \exp \left(-2 \gamma^{2} m\right) \end{aligned}\)
-
\(m\) stands for the size of the dataset, set the right side as \(\delta\), which stands for the probability of making a mistake greater than \(\gamma\) , we can solve one of the three with others fixed. e.g. \(m \geq \frac{1}{2 \gamma^{2}} \log \frac{2 k}{\delta}\)
-
compare our performance to that of \(h^{*}\):
- \(\begin{aligned} \varepsilon(\hat{h}) & \leq \hat{\varepsilon}(\hat{h})+\gamma \\ & \leq \hat{\varepsilon}\left(h^{*}\right)+\gamma \\ & \leq \varepsilon\left(h^{*}\right)+2 \gamma \end{aligned}\)
- \(\varepsilon(\hat{h}) \leq\left(\min _{h \in \mathcal{H}} \varepsilon(h)\right)+2 \sqrt{\frac{1}{2 m} \log \frac{2 k}{\delta}}\)
-
for infinite dimension, use VC dimension, and get a similar result:
- \(\varepsilon(\hat{h}) \leq \varepsilon\left(h^{*}\right)+O\left(\sqrt{\frac{d}{m} \log \frac{m}{d}+\frac{1}{m} \log \frac{1}{\delta}}\right)\)