目录
1 Introduction
- design, choice, high-dim, hyperparam
- \(x^* = argmax_{x\in \mathcal X}f(x)\)
- compact subset of \(\mathbb R^d\), or ...
- stochastic output \(\mathbb E[y|f(x)]=f(x)\)
- unbiased noisy point-wise observations
- data efficient, evaluations are costly
- prior, refine
- best choice? acquisition function \(\alpha_n: \mathcal X\to \mathbb R\)
-
- mean, confidence interval
- myopic heuristics
- uncertainty is large (exploration), or prediction is high (exploitation)
- acquisition function: easy to find the optimum, analytic?
2 Bayesian Optimization with Parametric Models
- parametrized by \(w\)
- \(\mathcal D\): data
- bayesian: \(p(w|D)=p(D|w)p(w)/p(D)\)
- beliefs about \(w\) after observing data \(D\)
- \(p(D)\) intractable, but in fact a normalizing constant
- prior: conjucacy, analytically
- \(K\) drugs, independent
- to optimize \(f\), on \(K\) indices, fully parametrized
- beta, conjugacy
- TS, simplest strategy, posterior prob of optimality, estimated, MC
- \(a_{n+1}=argmax_a f_{\bar w}(a)\)
- no more param other than the prior
- linear model, feature, vector, \(f_w(a)=x_a^T w\)
- \(X\): input vectors, \(y\): outputs
- nonlinear basis functions
- radial
- Fourier
- learned from data
- feature map, regardless, weights can be computed analytically
3 Nonparametric models
- start, observation variance \(\sigma^2\), zero-mean Gaussian prior \(V_0\), preserve Gaussianity
- basis functions, linear regression, symmetric positive-semidefinite, kernel
- intuitive similarity between pairs of points, rather than a feature map \(\Phi\)
- tractable, linear algebra, unnecessary to explicitly define \(\Phi\)
- GP, nonparametric model, prior mean, covariance
- \(f|X\sim \mathcal N(m,K)\)
- \(y|f, \sigma^2\sim \mathcal N(f,\sigma^2 I)\)
- posterior: use \(x\) and previous data (not "abstracted by parameters")
- kernel, structure, periodic, stationary
- Matern, diagonal, paramtrized
- kernel, smoothness and amplitude
- prior, possible offset, constant, expert knowledge