[IR] Compression

2023-01-14 07:50:28

关系：Vocabulary vs. collection size

Heaps’ law: M = kT^b
M
is
the
size
of
the
vocabulary,
T
is
the
number
of tokens
in
the
collec*on

Typical
values:
30
≤
k
≤
100
and
b
≈
0.5σ

log M = log K - b*log T

关系：Vocabulary中每个term的量 vs. 该term的次序

Zipf’s law: cf_i = K/i

i.e. the most frequent term (the) occurs cf₁ times

The i th most frequent term has frequency proportional
to
1/i
.

log cf_i = log K - log i

1). Term's data单独拿出成为String形式, Terms里变为了指针，size:4B
　　11.2 → 7.6

2). Blocking。If k = 4, then 省了3个terms的空间，即3B*3-4(结束符1B)=5B
　　7.6 → 7.1

3). Front coding, 前缀冗余。
　　7.1 → 5.9

如下：

1). Seq1 + 1000 = Seq3

小链表表示大链表

2). Simple9

0110（ID）， 3（三段）， 9（每段的bit数）， 1（最后的waste位的个数）。

那么，4+3*9+1 = 32byte = 4 Bit

3). Gap ( If the ave gap of a term is G)

log₂G bits/gap, 当然会用到之后的Variabe Byte codes.

4). Variable Byte codes.

增加Control Bit，那么完整的一个数据表示：（0数据，0数据，……，1最后一个数据）

5). Elias-γ code

6). Elias-δ code

7). Golomb code

暂略

码农公寓