非常重要的编辑:所有艾都是独一无二的.
问题
我有一个n个唯一对象的列表A.每个对象Ai具有可变百分比Pi.
我想创建一种算法,该算法生成k个对象的新列表B(k The probability that an object Ai appears in B is Pi. 我曾经尝试过什么 (这些片段在PHP中只是为了测试目的) 起初我尝试了以下两个算法(这些仅在PHP中用于测试): 和 两种算法之间的唯一区别是,当遇到重复时再次尝试,并且在拾取时删除对象形式列表A.事实证明,这两种算法具有相同的概率输出. 我运行了第二个算法100,000次并跟踪每个字母被挑选的次数.以下数组列出了基于100,000次测试在任何列表B中选取字母的百分比. 回顾这个算法时,这是有道理的.该算法错误地将原始百分比解释为对于任何给定位置而不是任何列表B拾取对象的百分比机会.因此,例如,实际上,在列表B中挑选Z的机会是93%,但是选择Z作为指数Bn的几率为20%.这不是我想要的.我希望在列表B中选择Z的几率为20%. 这甚至可能吗?怎么做到呢? 编辑1 我试过简单地得到所有Pi = k的总和,如果所有Pi都相等,这都有效,但是在修改它们的值之后,它开始变得越来越错. 初始概率 10,000次运行后的结果
我先做了一个清单A.$list = [
"A" => 2.5,
"B" => 2.5,
"C" => 2.5,
"D" => 2.5,
"E" => 2.5,
"F" => 2.5,
"G" => 2.5,
"H" => 2.5,
"I" => 5,
"J" => 5,
"K" => 2.5,
"L" => 2.5,
"M" => 2.5,
"N" => 2.5,
"O" => 2.5,
"P" => 2.5,
"Q" => 2.5,
"R" => 2.5,
"S" => 2.5,
"T" => 2.5,
"U" => 5,
"V" => 5,
"W" => 5,
"X" => 5,
"Y" => 5,
"Z" => 20
];
$result = [];
while (count($result) < 10) {
$rnd = rand(0,10000000) / 100000;
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
if ($rnd <= $sum) {
if (in_array($key,$result)) {
break;
} else {
$result[] = $key;
break;
}
}
}
}
$result = [];
while (count($result) < 10) {
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
}
$rnd = rand(0,$sum * 100000) / 100000;
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
if ($rnd <= $sum) {
$result[] = $key;
unset($list[$key]);
break;
}
}
}
[A] => 30.213
[B] => 29.865
[C] => 30.357
[D] => 30.198
[E] => 30.152
[F] => 30.472
[G] => 30.343
[H] => 30.011
[I] => 51.367
[J] => 51.683
[K] => 30.271
[L] => 30.197
[M] => 30.341
[N] => 30.15
[O] => 30.225
[P] => 30.135
[Q] => 30.406
[R] => 30.083
[S] => 30.251
[T] => 30.369
[U] => 51.671
[V] => 52.098
[W] => 51.772
[X] => 51.739
[Y] => 51.891
[Z] => 93.74
$list= [
"A" => 8.4615,
"B" => 68.4615,
"C" => 13.4615,
"D" => 63.4615,
"E" => 18.4615,
"F" => 58.4615,
"G" => 23.4615,
"H" => 53.4615,
"I" => 28.4615,
"J" => 48.4615,
"K" => 33.4615,
"L" => 43.4615,
"M" => 38.4615,
"N" => 38.4615,
"O" => 38.4615,
"P" => 38.4615,
"Q" => 38.4615,
"R" => 38.4615,
"S" => 38.4615,
"T" => 38.4615,
"U" => 38.4615,
"V" => 38.4615,
"W" => 38.4615,
"X" => 38.4615,
"Y" =>38.4615,
"Z" => 38.4615
];
Array
(
[A] => 10.324
[B] => 59.298
[C] => 15.902
[D] => 56.299
[E] => 21.16
[F] => 53.621
[G] => 25.907
[H] => 50.163
[I] => 30.932
[J] => 47.114
[K] => 35.344
[L] => 43.175
[M] => 39.141
[N] => 39.127
[O] => 39.346
[P] => 39.364
[Q] => 39.501
[R] => 39.05
[S] => 39.555
[T] => 39.239
[U] => 39.283
[V] => 39.408
[W] => 39.317
[X] => 39.339
[Y] => 39.569
[Z] => 39.522
)
解决方法:
我们必须有sum_i P_i = k,否则我们就不能成功.
如上所述,问题有点容易,但你可能不喜欢这个答案,理由是它“不够随意”.
Sample a uniform random permutation Perm on the integers [0, n)
Sample X uniformly at random from [0, 1)
For i in Perm
If X < P_i, then append A_i to B and update X := X + (1 - P_i)
Else, update X := X - P_i
End
您需要使用定点算术而不是浮点近似计算涉及实数的计算.
缺失的条件是分布具有称为“最大熵”的技术属性.像amit一样,我想不出一个好方法.这是一种笨拙的方式.
我解决这个问题的第一个(也就是错误的)本能是将每个A_i独立地包含在概率为P_i的B中并重试,直到B是正确的长度(不会有太多的重试,因为你可以问math.SE关于).问题是条件会扰乱概率.如果P_1 = 1/3且P_2 = 2/3且k = 1,则结果为
{}: probability 2/9
{A_1}: probability 1/9
{A_2}: probability 4/9
{A_1, A_2}: probability 2/9,
并且条件概率实际上是A_1的1/5和A_2的4/5.
相反,我们应该替换产生适当条件分布的新概率Q_i.我不知道Q_i的封闭形式,所以我建议使用像gradient descent这样的数值优化算法找到它们.初始化Q_i = P_i(为什么不呢?).使用动态编程,对于Q_i的当前设置,可以找到在给定具有l个元素的结果的情况下A_i是这些元素之一的概率. (我们只关心l = k条目,但我们需要其他人才能使重复发生.)通过更多的工作,我们可以得到整个梯度.对不起,这太粗略了.
在Python中,使用似乎总是收敛的非线性求解方法(将每个q_i同时更新为其边缘正确的值并进行标准化):
#!/usr/bin/env python3
import collections
import operator
import random
def constrained_sample(qs):
k = round(sum(qs))
while True:
sample = [i for i, q in enumerate(qs) if random.random() < q]
if len(sample) == k:
return sample
def size_distribution(qs):
size_dist = [1]
for q in qs:
size_dist.append(0)
for j in range(len(size_dist) - 1, 0, -1):
size_dist[j] += size_dist[j - 1] * q
size_dist[j - 1] *= 1 - q
assert abs(sum(size_dist) - 1) <= 1e-10
return size_dist
def size_distribution_without(size_dist, q):
size_dist = size_dist[:]
if q >= 0.5:
for j in range(len(size_dist) - 1, 0, -1):
size_dist[j] /= q
size_dist[j - 1] -= size_dist[j] * (1 - q)
del size_dist[0]
else:
for j in range(1, len(size_dist)):
size_dist[j - 1] /= 1 - q
size_dist[j] -= size_dist[j - 1] * q
del size_dist[-1]
assert abs(sum(size_dist) - 1) <= 1e-10
return size_dist
def test_size_distribution(qs):
d = size_distribution(qs)
for i, q in enumerate(qs):
d1a = size_distribution_without(d, q)
d1b = size_distribution(qs[:i] + qs[i + 1:])
assert len(d1a) == len(d1b)
assert max(map(abs, map(operator.sub, d1a, d1b))) <= 1e-10
def normalized(qs, k):
sum_qs = sum(qs)
qs = [q * k / sum_qs for q in qs]
assert abs(sum(qs) / k - 1) <= 1e-10
return qs
def approximate_qs(ps, reps=100):
k = round(sum(ps))
qs = ps
for j in range(reps):
size_dist = size_distribution(qs)
for i, p in enumerate(ps):
d = size_distribution_without(size_dist, qs[i])
d.append(0)
qs[i] = p * d[k] / ((1 - p) * d[k - 1] + p * d[k])
qs = normalized(qs, k)
print(qs)
return qs
def test(ps, reps=100000):
qs = approximate_qs(ps)
counter = collections.Counter()
for j in range(reps):
counter.update(constrained_sample(qs))
test_size_distribution(qs)
print(size_distribution(qs))
print('p', 'Actual', sep='\t')
for i, p in enumerate(ps):
print(p, counter[i] / reps, sep='\t')
if __name__ == '__main__':
test([i / 25 for i in range(26)])