【第一篇】Adaptive learned Bloom Filters under Incremental Workloads
ABSTRACT
The recently proposed paradigm of learned Bloom flters (LBF) seems to offer signifcant advantages over traditional Bloom flters in terms of low memory footprint and overall performance as evidenced by empirical evaluations over static data. Its behavior in presence of updates to the set of keys being stored in Bloom flters is not very well understood. At the same time, maintaining the false positive rates (FPR) of traditional Bloom flters in presence of dynamics has been studied and extensions to carefully expand memory footprint of the flters without sacrifcing FPR have been proposed. Building on these, we propose two distinct approaches for handling data updates encountered in practical uses of LBF: (i) CA-LBF, where we adjust the learned model (e.g., by retraining) to accommodate the new “unseen” data, resulting in classifer adaptive methods, and (ii) IA-LBF, where we replace the traditional Bloom flter with its adaptive version while keeping the learned model unchanged,leading to an index adaptive method. In this paper, we explore these two approaches in detail under incremental workloads, evaluating them in terms of their adaptability, memory footprint and false positive rates. Our empirical results using a variety of datasets and learned models of varying complexity show that our proposed methods’ ability to handle incremental updates is quite robust.
摘要:最近提出的学习布鲁过滤器的范例似乎在传统布鲁过滤器上有重大进步,在静态数据上进行了评估,内存占用更低,整体性能也有所改善。对于在关键字集上的更新的行为被存储在布鲁过滤器上,这并不好理解。同时,在动态维持传统布鲁过滤器的反正例率已经被学习,并扩展到过滤器的内存占用,不用习生已经被提出的FPR。基于这些,我们提出了两种不同的方法来处理LBF在实际应用下的数据更新:(1)CA-LBF,我们调整学习模型来适应新的未见过的数据,得到了适应型分类器的方法;(2)IA-LBF,我们将布鲁过滤器替换成适应型版本,保证学习模型不变,得到了适应型索引的方法。在本论文中,我们在工作量增加的情况下详细探索了两种方法,根据他们的适应性、内存占用和假正确率进行了评估。我们用了各种数据集和不同复杂度的学习模型,实验结果表明我们提出的方法处理递增的更新是非常有用的。
总结:论文在处理递增更新数据上有了改善,主要是通过两种方法,1是调整学习模型得到自适应型分类器;2是学习模型不变,调整布鲁过滤器得到自适应型索引。
【第二篇】Stable Learned Bloom Filters for Data Streams
ABSTRACT
Bloom filter and its variants are elegant space-efficient probabilistic data structures for approximate set membership queries. It has been recently shown that the space cost of Bloom filters can be significantly reduced via a combination with pre-trained machine learning models, named Learned Bloom filters (LBF). LBF eases the space requirement of a Bloom filter by undertaking part of the queries using a classifier. However, current LBF structures generally target a static member set. Their performances would inevitably decay when there is a member update on the set,while this update requirement is not uncommon for real world data streaming applications such as duplicate item detection, malicious URL checking, and web caching. To adapt LBF to data streams, we propose the Stable Learned Bloom Filters (SLBF) which addresses the performance decay issue on intensive insertion workloads by combining classifier with updatable backup filters. Specifically, we propose two SLBF structures, Single SLBF (s-SLBF) and Grouping SLBF (g-SLBF). The theoretical analysis on these two
structures shows that the expected false positive rate (FPR)of SLBF is asymptotically a constant over the insertion of new members. Extensive experiments on real-world datasets show that SLBF introduces a similar level of false negative rate (FNR) but yields a better FPR/storage trade-off compared with the state-of-the-art (non-learned) Bloom filters optimized on data streams.
摘要:布鲁过滤器及其变体是用于近似集成员查询的空间高效概率数据结构。最近的研究表明,布鲁过滤器的空间消耗可以通过结合预训练的机器学习模型显著减少。LBF通过用分类器承担部分查询缓解了布鲁过滤器的空间需求。然而,当前的LBF结构通常以静态成员集作为目标。当有新的数据出现时,他们的性能不可避免地会降低,虽然这种更新需求在现实世界的数据流应用中并不少见,例如重复项检测、恶意URL检查和web缓存。为了让LBF适应数据流,我们提出了稳定的学习布鲁过滤器模型,他通过将分类器和可更新的备份滤波器相结合,解决了在密集插入工作负载下性能下降的问题。特别地,我们提出了两种SLBF结构,单个SLBF和组SLBF。这两个结构的理论分析表明,SLBF预期的的反正例率随着新数据的加入逐渐趋向于一个常数。在现实数据集上广泛的实验表明,SLBF引入了类似级别的假反例率,但是比起现有的针对数据流优化的最优的布鲁过滤器来说,它产生了更好的FPR/存储权衡。
总结:提出了稳定的学习型布鲁过滤器模型解决密集插入工作负载下(数据流)性能下降的问题,通过将分类器和可更新的备份滤波器相结合。
【第三篇】Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters
ABSTRACT
We present vacuum filters, a type of data structures to support approximate membership queries. Vacuum filters cost the smallest space among all known AMQ data structures and provide higher insertion and lookup throughput in most situations. Hence they can be used as the replacement of the widely used Bloom filters and cuckoo filters. Similar to cuckoo filters, vacuum filters also store item fingerprints in a
table. The memory-efficiency and throughput improvements are from the innovation of a table insertion and fingerprint eviction strategy that achieves both high load factor and data locality without any restriction of the table size. In addition, we propose a new update framework to resolve two difficult problems for AMQ structures under dynamics,namely duplicate insertions and set resizing. The experi-
ments show that vacuum filters can achieve 25% less space in average and similar throughput compared to cuckoo filters, and 15% less space and >10x throughput compared to Bloom filters, with same false positive rates. AMQ data structures are widely used in various layers of computer systems and networks and are usually hosted in platforms where memory is limited and precious. Hence the improvements brought by vacuum filters can be considered significant.
摘要:我们提出了vacuum filter,一种能支持近似成员查询的数据结构。Vacuum过滤器在所有已知的AMQ数据结构中耗费最少的空间,并且在大多数情况下,能提供更高的插入和查找吞吐量。因此,它可以用来替换广泛应用的布鲁过滤器和布谷过滤器。与布谷过滤器相似,Vacuum过滤器也将项目指纹存储在表中。内存效率和吞吐量的改善都来自于表插入的创新和指纹驱逐策略,它们实现了高负载因子和无表大小限制的数据本地化。除此之外,我们提出了新更新的框架来解决动态数据下AMQ结构的复杂问题,也就是副本插入和设置调整。实验表明,在相同的假正例率下,真空过滤器能低于平均数据的25%的空间占用和与布谷过滤器相当的吞吐量,以及与布鲁过滤器相比更少的空间和更多的吞吐量。
总结:提出了Vacuum过滤器,能耗费最少的空间实现更大的吞吐量,主要通过表插入的创新和指纹驱逐策略来实现高负载因子和数据本地化,并且提出新框架解决了动态数据的副本插入和设置调整。
【第四篇】The Potential of Learned Index Structures for Index Compression
ABSTRACT
Inverted indexes are vital in providing fast key-word-based search.For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory,and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have.Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research.
摘要:倒排索引在基于关键字的搜索上非常重要。对于每个在文件集合的项,存储该术语出现在其中的文档标识符列表,以及其附带的信息,例如频率、位置偏移。虽然倒排索引非常有效,但是网页大小的集合需要很大的内存。最近,提出了学习索引的概念,机器学习模型替换传统的索引结构,例如B树索引,哈希索引和布鲁过滤器。学习索引结构需要的内存更少,并且比传统的计算速度更快,在本论文中,我们考虑这些模型是否可以应用于 连接二值查询。首先,我们研究一个学习过的模型如何取代一个倒排索引的位置,然后评估这种方法可能带来的折衷。其次,我们评估在内存需求方面可以实现的潜在收益。我们的工作表明,学习索引在倒排索引上,也有很大的潜力,这个方向似乎是未来研究一个有前途的领域。
想法:倒排学习索引??
【第五篇】Pavo: A RNN-Based Learned Inverted Index,Supervised or Unsupervised?
ABSTRACT:
The booms of big data and graphic processing unit technologies have allowed us to explore more appropriate data structures and algorithms with smaller time complexity. However, the application of machine learning as a potential alternative for the traditional data structures, especially using deep learning,is a relatively new and largely uncharted territory. In this paper, we propose a novel recurrent neural network-based learned inverted index, called Pavo, to efficiently and flexibly organize inverted data. The basic hash function in the traditional inverted index is replaced by a hierarchical neural network, which makes Pavo be able to well adapt for various data distributions while showing lower collision rate as well as higher space utilization rate. A particular feature of our approach is that a novel unsupervised learning strategy to construct the hash function is proposed. To the best of our knowledge, there are no similar results, in which the unsupervised learning strategy is employed to design hash functions, in the existing literature. Extensive experimental results show that the unsupervised model owns some advantages than the supervised one. Our approaches not only demonstrate the feasibility of deep learning-based data structures for index purpose but also provide benefits for developers to make more accurate decisions on both the design and the configuration of data organization, operation, and parameters tuning of neural network so as to improve the performance of information searching.
摘要:大数据以及图处理单元技术的快速发展让我们追求更合适的数据结构和算法来达到更小的时间复杂度。然而,机器学习的应用对于替代传统的数据结构有很大的潜力,尤其是用深度学习,是一大块全新的未知领域。本文提出了一种基于循环神经网络的新型倒排索引,称作Pavo,有效灵活地组织倒排数据。传统倒排索引中基本的哈希函数由分级神经网络代替,这使得Pavo能更好地适应不同的数据分布以达到较低的碰撞率和较高的空间利用率。我们的方法一个独特的特征是,构建了哈希函数。据我们所知,将无监督学习应用于设计哈希函数,这在现有的文献中从未出现过。广泛的实验结果表明,无监督学习比监督学习更加有优势。我们的方法不仅证明了深度学习的数据结构对于索引的便利性,而且在数据组织的设计和配置、操作以及神经网络的调参上都更加准确,这给开发者提供了便利,并且也改善了信息检索的性能。
总结:将无监督学习用于构建哈希函数,产生了基于循环神经网络的新型倒排索引。