No Free Lunch定理
定理(No Free Lunch): 假定 A \mathcal{A} A是一个在域 X \mathcal{X} X的二分类任务中任意一个机器学习算法,其损失函数为 0 - 1 0\text{-}1 0-1损失。令 n n n是一个大小为 ∣ X ∣ / 2 |\mathcal{X}|/2 ∣X∣/2的训练集,存在域 X \mathcal{X} X中的分布 D \mathcal{D} D,则有
(1)存在一个函数 f : X → { 0 , 1 } f:\mathcal{X}\rightarrow \{0,1\} f:X→{0,1},且有 L D ( f ) = 0 L_\mathcal{D}(f)=0 LD(f)=0。
(2)对于子列 S ∼ D n \mathcal{S\sim D}^n S∼Dn,则概率不等式 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) ≥ 1 / 7 P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)\ge 1/7 P(LD(A(S))≥1/8:S∼Dn)≥1/7成立。
证明:
(1) 令
C
\mathcal{C}
C表示域
X
\mathcal{X}
X中大小为
2
n
2n
2n的一个子集。主要的证明思路是只利用数据集
C
\mathcal{C}
C一半的数据样本点并不能给出剩下一半数据点的信息。 假定
H
\mathcal{H}
H表示数据集
C
\mathcal{C}
C到标签集合
{
0
,
1
}
\{0,1\}
{0,1}所有可能的函数集合,且
T
T
T表示的是函数集合的基数,其中
H
=
{
f
1
,
⋯
,
f
T
}
\mathcal{H}=\{f_1,\cdots,f_T\}
H={f1,⋯,fT},
T
=
2
2
n
T=2^{2n}
T=22n。对于
H
\mathcal{H}
H中每一个函数假设,令
D
i
\mathcal{D}_i
Di是
C
×
{
0
,
1
}
\mathcal{C}\times\{0,1\}
C×{0,1}中的分布
D
i
(
{
(
x
,
y
)
}
)
=
{
1
/
2
m
i
f
y
=
f
i
(
x
)
0
o
t
h
e
r
w
i
s
e
\mathcal{D}_i(\{(x,y)\})=\left\{\begin{array}{ll}1/2m & \mathrm{if}\text{ } y=f_i(x)\\0& \mathrm{otherwise}\end{array}\right.
Di({(x,y)})={1/2m0if y=fi(x)otherwise进而可知存在函数
f
i
f_i
fi,在数据分布
D
i
\mathcal{D}_i
Di上则有
L
D
i
(
f
i
)
=
0
L_{\mathcal{D}_i}(f_i)=0
LDi(fi)=0。
(2)主要证明的关键在于即对任意的学习算法
A
\mathcal{A}
A有
max
i
∈
[
T
]
E
S
∼
D
i
n
[
L
D
i
(
A
(
S
)
)
]
≥
1
/
4
\max\limits_{i \in [T]}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]\ge 1 / 4
i∈[T]maxES∼Din[LDi(A(S))]≥1/4首先从
C
×
{
0
,
1
}
\mathcal{C}\times \{0,1\}
C×{0,1}中采样出
n
n
n个样本构造一个训练集,其中采样出的样本可以重复,进而可知有
k
=
(
2
n
)
n
k=(2n)^n
k=(2n)n中可能的样本序列。令这些样本序列分别表示为
S
1
,
S
2
,
⋯
,
S
k
\mathcal{S}_1,\mathcal{S}_2,\cdots,\mathcal{S}_k
S1,S2,⋯,Sk。
S
j
i
=
(
(
x
1
,
f
i
(
x
1
)
)
,
⋯
,
(
x
n
,
f
i
(
x
n
)
)
)
\mathcal{S}_j^i=((x_1,f_i(x_1)),\cdots,(x_n,f_i(x_n)))
Sji=((x1,fi(x1)),⋯,(xn,fi(xn)))表示的是函数
f
j
f_{j}
fj在样本序列
S
j
S_j
Sj中的数据集合,则有
E
S
∼
D
i
n
[
L
D
i
(
A
(
S
)
)
]
=
1
k
∑
j
=
1
k
L
D
i
(
A
(
S
j
i
)
)
\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]=\frac{1}{k}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))
ES∼Din[LDi(A(S))]=k1j=1∑kLDi(A(Sji))又因为
"
m
a
x
i
m
u
m
"
≥
"
a
v
e
r
a
g
e
"
≥
"
m
i
n
i
m
u
m
"
\mathrm{"maximum"}\ge \mathrm{"average"}\ge \mathrm{"minimum"}
"maximum"≥"average"≥"minimum",所以则有
max
i
∈
[
T
]
1
T
∑
j
=
1
k
L
D
i
(
A
(
S
j
i
)
)
≥
1
T
∑
i
=
1
T
1
k
∑
j
=
1
k
L
D
i
(
A
(
S
j
i
)
)
=
1
k
∑
j
=
1
k
1
T
∑
i
=
1
T
L
D
i
(
A
(
S
j
i
)
)
≥
min
j
∈
[
k
]
1
T
∑
i
=
1
T
L
D
i
(
A
(
S
j
i
)
)
\begin{aligned}\max\limits_{i\in [T]}\frac{1}{T}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))&\ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{k}\sum\limits_{j=1}^kL_{\mathcal{D}_i}(\mathcal{A(\mathcal{S}_j^i)})\\&=\frac{1}{k}\sum\limits_{j=1}^k\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\& \ge \min\limits_{j \in [k]}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))\end{aligned}
i∈[T]maxT1j=1∑kLDi(A(Sji))≥T1i=1∑Tk1j=1∑kLDi(A(Sji))=k1j=1∑kT1i=1∑TLDi(A(Sji))≥j∈[k]minT1i=1∑TLDi(A(Sji))现固定
j
∈
[
k
]
j \in [k]
j∈[k],令
S
j
=
{
x
1
,
⋯
,
x
n
}
\mathcal{S}_j=\{x_1,\cdots,x_n\}
Sj={x1,⋯,xn},
C
\
S
j
=
{
v
1
,
⋯
,
v
p
}
C \backslash \mathcal{S}_j=\{v_1,\cdots,v_p\}
C\Sj={v1,⋯,vp},其中
p
≥
n
p\ge n
p≥n是剩余没有采样的样本数。对于每一个函数
h
:
C
→
{
0
,
1
}
h:C \rightarrow\{0,1\}
h:C→{0,1},
i
∈
[
T
]
i\in[T]
i∈[T]有
L
D
i
(
h
)
=
1
2
n
∑
x
∈
C
I
(
h
(
x
)
≠
f
i
(
x
)
)
=
1
2
n
(
∑
l
=
1
n
I
(
h
(
x
l
)
≠
f
i
(
x
l
)
)
+
∑
r
=
1
p
I
(
h
(
v
r
)
≠
f
i
(
v
r
)
)
)
≥
1
2
n
∑
r
=
1
p
I
(
h
(
v
r
)
≠
f
i
(
v
r
)
)
≥
1
2
p
∑
r
=
1
p
I
(
h
(
v
r
)
≠
f
i
(
v
r
)
)
\begin{aligned}L_{\mathcal{D}_i}(h)&=\frac{1}{2n}\sum\limits_{x \in C}\mathbb{I}(h(x)\ne f_i(x))\\&=\frac{1}{2n}\left(\sum\limits_{l=1}^n \mathbb{I}(h(x_l)\ne f_i(x_l))+\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\right)\\&\ge \frac{1}{2n}\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2p}\sum\limits_{r=1}^{p}\mathbb{I}(h(v_r)\ne f_i(v_r))\end{aligned}
LDi(h)=2n1x∈C∑I(h(x)=fi(x))=2n1(l=1∑nI(h(xl)=fi(xl))+r=1∑pI(h(vr)=fi(vr)))≥2n1r=1∑pI(h(vr)=fi(vr))≥2p1r=1∑pI(h(vr)=fi(vr))所以
1
T
∑
i
=
1
T
L
D
i
(
A
(
S
j
i
)
)
≥
1
T
∑
i
=
1
T
1
2
p
∑
r
=
1
p
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
i
(
v
r
)
)
=
1
2
p
∑
r
=
1
p
1
T
∑
i
=
1
T
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
i
(
v
r
)
)
≥
1
2
min
r
∈
[
p
]
1
T
∑
i
=
1
T
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
i
(
v
r
)
)
\begin{aligned}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))& \ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{2p}\sum\limits_{r=1}^p \mathbb{I}(\mathcal{A(\mathcal{S}^i_j)(v_r)\ne f_i(v_r)})\\&=\frac{1}{2p}\sum\limits_{r=1}^p\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T\mathbb{I}(\mathcal{A}(\mathcal{S}_j^i)(v_r)\ne f_i(v_r))\end{aligned}
T1i=1∑TLDi(A(Sji))≥T1i=1∑T2p1r=1∑pI(A(Sji)(vr)=fi(vr))=2p1r=1∑pT1i=1∑TI(A(Sji)(vr)=fi(vr))≥21r∈[p]minT1i=1∑TI(A(Sji)(vr)=fi(vr))对于给定的
r
∈
[
p
]
r\in [p]
r∈[p],因为
T
T
T是所有可能函数映射的基数,所以总有成对存在的
a
,
b
∈
[
T
]
a,b\in [T]
a,b∈[T]有
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
a
(
v
r
)
)
+
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
b
(
v
r
)
)
=
1
\mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_a(v_r))+\mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_b(v_r))=1
I(A(Sji)(vr)=fa(vr))+I(A(Sji)(vr)=fb(vr))=1进而则有
E
S
∼
D
i
n
[
L
D
i
(
A
(
S
)
)
]
=
e
1
T
∑
i
=
1
T
L
D
i
(
A
(
S
j
i
)
)
≥
1
2
min
r
∈
[
p
]
1
T
∑
i
=
1
T
I
(
A
(
S
j
i
)
(
v
r
)
≠
f
i
(
v
r
)
)
=
1
2
⋅
1
T
⋅
T
2
=
1
4
\begin{aligned}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))] &= e \frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&=\frac{1}{2}\cdot \frac{1}{T}\cdot \frac{T}{2}=\frac{1}{4}\end{aligned}
ES∼Din[LDi(A(S))]=eT1i=1∑TLDi(A(Sji))≥21r∈[p]minT1i=1∑TI(A(Sji)(vr)=fi(vr))=21⋅T1⋅2T=41根据马尔可夫不等式的修改版可知,给定一个随机变量
X
∈
[
0
,
1
]
X \in[0,1]
X∈[0,1],给定一个常数
a
∈
[
0
,
1
]
a\in [0,1]
a∈[0,1],进而则有
E
[
X
]
=
∫
0
1
x
p
(
x
)
d
x
=
∫
0
a
x
p
(
x
)
d
x
+
∫
a
1
x
p
(
x
)
d
x
≤
a
∫
0
a
p
(
x
)
d
x
+
∫
a
1
p
(
x
)
d
x
=
a
(
1
−
p
{
X
≥
a
}
)
+
p
{
X
≥
a
}
=
a
+
(
1
−
a
)
P
{
X
≥
a
}
\begin{aligned}\mathbb{E}[X]&=\int_0^1 x p(x)dx\\&= \int_0^a x p(x)dx + \int_a^1xp(x)dx\\&\le a \int_0^a p(x)dx + \int^1_a p(x)dx\\&=a(1-p\{X\ge a\})+p\{X\ge a\}\\&=a+(1-a)P\{X\ge a\}\end{aligned}
E[X]=∫01xp(x)dx=∫0axp(x)dx+∫a1xp(x)dx≤a∫0ap(x)dx+∫a1p(x)dx=a(1−p{X≥a})+p{X≥a}=a+(1−a)P{X≥a}马尔可夫不等式为
P
{
X
≥
a
}
≥
E
[
X
]
−
a
1
−
a
P\{X\ge a\}\ge \frac{\mathbb{E}[X]-a}{1-a}
P{X≥a}≥1−aE[X]−a利用马尔可夫不等可知
P
(
L
D
(
A
(
S
)
)
≥
1
/
8
:
S
∼
D
n
)
=
E
S
∼
D
i
n
[
L
D
i
(
A
(
S
)
)
]
−
1
/
8
1
−
1
/
8
≥
1
/
4
−
1
/
8
1
−
1
/
8
=
1
7
\begin{aligned}P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)&=\frac{\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]-1/8}{1-1/8}\\& \ge \frac{1/4-1/8}{1-1/8}=\frac{1}{7}\end{aligned}
P(LD(A(S))≥1/8:S∼Dn)=1−1/8ES∼Din[LDi(A(S))]−1/8≥1−1/81/4−1/8=71