[CdL16] AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures
Revue Internationale avec comité de lecture :
Journal Journal of Computer Science and Technology,
vol. 31(1),
pp. 147-166,
2016
Mots clés: Full text indexing, Large scale indexing, Algebraic signatures
Résumé:
We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-index
relies on a classical inverted file structure, its main innovation being a probabilistic search based on the properties of algebraic
signatures used both for n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out
a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant
number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments
on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional
to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of
applications that require very fast lookups in large textual databases.
We describe the index structure, our use of algebraic signatures and the search algorithm. We discuss the operational
trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental
performance analysis. We next compare the AS-Index to the state-of-the-art alternatives and show that (i) the construction
time matches that of the competitors, due to the similarity of structures, (ii) the search time constantly outperforms the
standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of
our search method.