redis6.0.5之HyperLogLog阅读笔记1-基数估算前言翻译

/* The Redis HyperLogLog implementation is based on the following ideas:
Redis的超对数实现是基于以下的想法:
 * * The use of a 64 bit hash function as proposed in [1], in order to don't
 *   limited to cardinalities up to 10^9, at the cost of just 1 additional
 *   bit per register.
使用文1推荐的64位的哈希函数,仅用对每个寄存器增加一个比特的代价,就可以突破基数个数10的9次方的限制。
 * * The use of 16384 6-bit registers for a great level of accuracy, using
 *   a total of 12k per key.
为了获得一个比较高的精度,使用16384个6比特的寄存器,每个键值使用了12k(2的14次方 * 6 = 12k)
 * * The use of the Redis string data type. No new type is introduced.
 * * No attempt is made to compress the data structure as in [1]. Also the
 *   algorithm used is the original HyperLogLog Algorithm as in [2], with
 *   the only difference that a 64 bit hash function is used, so no correction
 *   is performed for values near 2^32 as in [1].
使用的是redis的字符串类型,没有引入新的数据类型。
没有采用文1中压缩数据结构的算法。同时使用了文2中原始的超对数算法,唯一不同的是使用了64位的哈希函数,
所以不用对类似文1中接近2^32的值做修正。
 * [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic
 *     Engineering of a State of The Art Cardinality Estimation Algorithm.
 *
 * [2] P. Flajolet, Eric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The
 *     analysis of a near-optimal cardinality estimation algorithm.
(这两篇文章留到后面再慢慢翻译)
 * Redis uses two representations:
redis使用两种表示方式:
 * 1) A "dense" representation where every entry is represented by
 *    a 6-bit integer.
密集表示方式 每个实体用6比特整数来表示
 * 2) A "sparse" representation using run length compression suitable
 *    for representing HyperLogLogs with many registers set to 0 in
 *    a memory efficient way.
稀疏表示方式  使用行程长的压缩方式表示具有很多个0值寄存器的超级日志,这是一中非常有效的内存表示方法
 *
 * HLL header  HLL的头部
 * ===
 *
 * Both the dense and sparse representation have a 16 byte header as follows:
密集和稀疏的表示都拥有一个16个字节的如下头部:
 * +------+---+-----+----------+
 * | HYLL | E | N/U | Cardin.  |
 * +------+---+-----+----------+
 *
 * The first 4 bytes are a magic string set to the bytes "HYLL".
 * "E" is one byte encoding, currently set to HLL_DENSE or
 * HLL_SPARSE. N/U are three not used bytes.
开始的4个字节是一个魔法数,被设置为HYLL.
E表示一个字节的编码,当前用来设置HLL_DENSE或者HLL_SPARSE。N/U是3个还么有被使用的字节。
 * The "Cardin." field is a 64 bit integer stored in little endian format
 * with the latest cardinality computed that can be reused if the data
 * structure was not modified since the last computation (this is useful
 * because there are high probabilities that HLLADD operations don't
 * modify the actual data structure and hence the approximated cardinality).
Cardin.字段 是一个小端格式保存的最近计算得到64位的集合中元素个数。
如果自从上次计算之后数据结构没有发生变化,那么可以这个数值可以重复使用。
(这个是很有用的,因为HLLADD操作大概率不会修改实际的数据结构,从而不需要修改近似的集合中元素个数)

 * When the most significant bit in the most significant byte of the cached
 * cardinality is set, it means that the data structure was modified and
 * we can't reuse the cached value that must be recomputed.
一旦缓存基数的最重要字节的最重要的比特位被设置,
那就意味着数据结构被修改并且我们不能使用缓存的值,必须要重新计算
 * Dense representation  密集表示
 * ===
 *
 * The dense representation used by Redis is the following:
redis使用的密集表示如下
 * +--------+--------+--------+------//      //--+
 * |11000000|22221111|33333322|55444444 ....     |
 * +--------+--------+--------+------//      //--+
 *
 * The 6 bits counters are encoded one after the other starting from the
 * LSB to the MSB, and using the next bytes as needed.
6个比特位从lsb到msb依次编码,根据情况使用下一个字节的比特位。

 * Sparse representation 稀疏表示
 * ===
 *
 * The sparse representation encodes registers using a run length
 * encoding composed of three opcodes, two using one byte, and one using
 * of two bytes. The opcodes are called ZERO, XZERO and VAL.
稀疏表示编码寄存器使用了包含3种操作符的行程长编码,其中两个操作符使用1个字节,一个操作符使用两个字节。
这三个操作符分别为 ZERO, XZERO and VAL.
 * ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented
 * by the six bits 'xxxxxx', plus 1, means that there are N registers set
 * to 0. This opcode can represent from 1 to 64 contiguous registers set
 * to the value of 0.
ZERO操作符表示为00xxxxxx。6比特的整数用六个比特的'xxxxxx'表示,还要加上1,即意味着有N个寄存器被设置为0.
这个操作符可以表示从1到64个连续为0的寄存器。
 * XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit
 * integer represented by the bits 'xxxxxx' as most significant bits and
 * 'yyyyyyyy' as least significant bits, plus 1, means that there are N
 * registers set to 0. This opcode can represent from 0 to 16384 contiguous
 * registers set to the value of 0.
XZERO操作符用两个字节01xxxxxx yyyyyyyy来表示。14个比特的整数用高6位的比特'xxxxxx'和低8位的比特'yyyyyyyy'表示。
同时还需要加1,即意味着N个寄存器设置为0.这个操作符能够代表从0(加1的话,这里应该为1)到16384个连续为0的寄存器。
 * VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer
 * representing the value of a register, and a 2-bit integer representing
 * the number of contiguous registers set to that value 'vvvvv'.
 * To obtain the value and run length, the integers vvvvv and xx must be
 * incremented by one. This opcode can represent values from 1 to 32,
 * repeated from 1 to 4 times.
VAL操作符由1vvvvvxx表示,其中'vvvvv'5比特的整数表示寄存器的值,'xx'2比特的整数表示连续的值为'vvvvv'的寄存器个数。
为了获取值和行程长,整数 vvvvv 和 xx 必须加1.这个操作符表示从1到32的值,和1到4的重复次数。
 * The sparse representation can't represent registers with a value greater
 * than 32, however it is very unlikely that we find such a register in an
 * HLL with a cardinality where the sparse representation is still more
 * memory efficient than the dense representation. When this happens the
 * HLL is converted to the dense representation.
稀疏表示不能表示值超过值为32的寄存器,然而,不可能在HLL算法中找到这样一个寄存器,稀疏表示任然比密集表示更加内存有效。
如果这种情况发生,那么HLL算法就改用密集表示。
 * The sparse representation is purely positional. For example a sparse
 * representation of an empty HLL is just: XZERO:16384.
稀疏表示纯粹是位置表示。举一个稀疏表示的例子,一个空的HLL表示,就是 XZERO:16384
 * An HLL having only 3 non-zero registers at position 1000, 1020, 1021
 * respectively set to 2, 3, 3, is represented by the following three
 * opcodes:
一个HLL只有3个非零的寄存器,分别在位置1000, 1020, 1021,对应的值为2,3,3, 可以用如下的操作符来表示
 * XZERO:1000 (Registers 0-999 are set to 0)   从0到999的寄存器都是0
 * VAL:2,1    (1 register set to value 2, that is register 1000)   寄存器1000的值是2
 * ZERO:19    (Registers 1001-1019 set to 0)  寄存器1001到1019的寄存器都是0
 * VAL:3,2    (2 registers set to value 3, that is registers 1020,1021) 寄存器1020,1021的值都是3
 * XZERO:15362 (Registers 1022-16383 set to 0) 寄存器1022到16383的值都是0
 *
 * In the example the sparse representation used just 7 bytes instead
 * of 12k in order to represent the HLL registers. In general for low
 * cardinality there is a big win in terms of space efficiency, traded
 * with CPU time since the sparse representation is slower to access:
在上面的例子中,稀疏表示值使用了7个字节而不是12K表示了HLL寄存器。一般来说,对于低基数,在空间效率方面有很大优势,
因为稀疏表示访问速度比较慢,可以用CPU的时间来交换。

 * The following table shows average cardinality vs bytes used, 100
 * samples per cardinality (when the set was not representable because
 * of registers with too big value, the dense representation size was used
 * as a sample).
下面的表格展示了平均基数 和 使用字节的对比,
每基数100个样本(当集合由于值太大而无法表示时,使用密集表示大小作为样本)

 * 100 267
 * 200 485
 * 300 678
 * 400 859
 * 500 1033
 * 600 1205
 * 700 1375
 * 800 1544
 * 900 1713
 * 1000 1882
 * 2000 3480
 * 3000 4879
 * 4000 6089
 * 5000 7138
 * 6000 8042
 * 7000 8823
 * 8000 9500
 * 9000 10088
 * 10000 10591
 *
 * The dense representation uses 12288 bytes, so there is a big win up to
 * a cardinality of ~2000-3000. For bigger cardinalities the constant times
 * involved in updating the sparse representation is not justified by the
 * memory savings. The exact maximum length of the sparse representation
 * when this implementation switches to the dense representation is
 * configured via the define server.hll_sparse_max_bytes.
 */
密集表示法使用12288字节,因此在基数为~2000-3000的情况下有很大的优势。
对更大的基数,更新稀疏表示的常量时间不是简单的内存节省。
当实现切换到密集表示的时,准确的稀疏表示的最大长度被配置在变量 server.hll_sparse_max_bytes中

config.c中  默认的初始化配置为3000
createSizeTConfig("hll-sparse-max-bytes", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.hll_sparse_max_bytes, 3000, MEMORY_CONFIG, NULL, NULL)







 

 

上一篇:南大《软件分析》——Intermediate Representation


下一篇:End-to-end representation learning for Correlation Filter based tracking