Approximative data structures for natural language processing



Some say software developers draw their motivation from minimizing or maximizing numbers in any given problem. That's a smug innuendo. From my experience, developers are always on the lookout for beautiful solutions, of which numbers are but a symptom. The usage of approximative data structures for language processing is one such example of a beautiful idea with nice numbers.


When dealing with natural language we're often confronted with the need of storing and retrieving language statistics, as e.g. n-Gram distributions, word counts, document frequency counts, etc. Inherent to the discrete nature of language, most of these statistics fall into a key-value schema, i.e. for a particular key (say, a word) we want to know a number. Let's focus on one particularly well known statistics, the document frequency (DF). The DF of a word is the number of documents in a corpus in which this word appears. The DF is used in the computation of the TFIDF, a simple yet proven heuristic for determining relevance.


To put things into perspective, the English Wikipedia has around 7Mio articles and the same order of magnitude of distinct words. If we were to use a hash map for storing the DFs, we would need approximately 100MB of memory just to store the words and counters (add at least another 100MB for data structure overhead). Thats for one language and one kind of statistics. One way to achieve lower memory requirements is to abdicate precision. This is a small price to pay, since we're dealing with corpus statistics, an intrinsically noisy affair.


Count-Min Sketch [C&M 2005] is a sub-linear space data structure for computing approximate frequencies. It allows us to trade-off exactitude for memory. By tuning its two only parameters ( ϵ and δ ) we can answer the question of how many times we have seen a word within ϵ ⋅ | w | (where | w | is the number of distinct words in a corpus) of the actual value with probability δ . The memory needed for this will be proportional to 1 ϵ ⋅ - log 2 ( 1 - δ ) . Count-Min Sketches (CMS) share some similarities with bloom filters by using pairwise independent hash functions and sharing memory among keys.


Figure 1

A CMS holds a two dimensional array of counters. The depth of the array determines the confidence probability ( δ ) and also the amount of hash functions we will use. The width determines the mean error we accept for count estimates. The two operations defined on a CMS are: add(key, value) and estimate(key). On add(key, value) we add value to the counters in the buckets corresponding to the computed hash values for each row using the independent hash functions (fig. 1). On estimate(key) we return the minimum value out of all corresponding buckets (fig. 2). As a side note: an easy improvement on the error expectation in detriment of a little runtime overhead is to change add(key, value) to only update the minimal buckets.


Figure 2

One easy observation is that a CMS suffers under saturation, meaning that the precision guarantees depend on the number of distinct entries (words in our case); i.e. the more distinct words are added, the larger the chance of hash collisions distorting the counts. In a CMS errors are always positive, since there is no operation decreasing counter values. Yet another observation is that the errors have bigger impact on rare entries. Following the latter observation, CMS errors have a smoothing effect on word counts, which in NLP is something often required. As a cherry on the cake: CMSs have associative properties and can be elegantly distributed.


At IntraFind we have been using CMSs and other approximation techniques (e.g. sampling) in a few nlp applications with great success. We see approximative data structures as a vital technology block in a big data realm. At the present moment we are working on a novel approximative data structure, which we call a converging map, to store mappings from normal forms of terms or term compounds to their most frequent surface form appearing in a corpus, without the need of storing neither the normal forms nor the less frequent surface forms.




  • [C&M 2005]: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.


Der Autor

Breno Faria

Breno Faria, Head of Development, ist seit 2012 für die IntraFind Software AG tätig. Seit den späten 2000er Jahren beschäftigt er sich intensiv mit den Themen Content Analytics und Information Retrieval. 2015 übernahm er die Rolle des Entwicklungsleiters bei IntraFind.


Im Rahmen von Veranstaltungen, z.B. Berlin Buzzwords 2014 oder "IntraFind Enterprise Search Day 2015", referiert er regelmäßig über neue Technologien oder präsentiert innovative Lösungen aus IntraFind Kundenprojekten.