Counting counts –
arguments for using statistics to process language
In my early student years I had a casual conversation with a computer science researcher. The researcher worked on home care robotics for the elderly. I got interested and wanted to know more about it. The objective was to have a robot learn how to cook, clean a toilet and wash the dishes. All by itself. And: "at some point a machine will be able to understand text and learn stuff from the internet, so we concentrate on what happens after that".
This struck me as a very bold assumption. Here is an analogy: thinking about what kind of special stretching exercises I should do once I beat Usain Bolt on the 100m sprint. I'm not arguing that it's impossible (you don't know me!), and I won't deny the possibility of language understanding becoming a fact, but it was the hardest part in that system. And it got my attention.
Very soon I got my hands on a book on precisely this topic called "Foundations of Statistical Natural Language Processing" by Christopher Manning and Hinrich Schütze. It is a remarkably well written book, by authors that don't shy away from looking beyond the border of their field. The introductory chapter puts the approach of a quantitative, non-symbolic and non-logical Natural Language Processing (the authors' definition of statistical NLP) into its historic and language-philosophical context.
In this post I want to revisit some aspects of this introduction and go a little deeper into what is only one short section on page 17, namely Ludwig Wittgenstein's argument of "meaning is use" (Philosophical Investigations, 1953), and how it can be seen as a philosophical justification for statistical NLP, including machine learning.
Empirical arguments – the statistical modelling of language works
The successful use of statistical models in language is naturally a killer argument for using these methods. Naive Bayes classifiers, Latent Dirichlet Allocation models, Support Vector Machines, Deep Neural Nets, Hidden Markov Models, have all achieved remarkable results on several NLP tasks in the last years. But how did this all begin?
The American linguist George K. Zipf was one of the first (early 20th century) to study statistical properties of language. He found out, that if one orders the words of a language according to their frequency – giving the most frequent word the rank 1, the second rank 2, etc. – the product of frequency * rank will remain approximately constant for all words. In other words, the most common word occurs approximately twice as often as the second most common word, and three times as often as the third most frequent word, and so on. This remarkable property of natural languages is called Zipf's Law. Several well-known best practices in statistical NLP go back to this discovery, e.g. ignoring words in stopword lists. These are lists of the most common words in a corpus that occur that often that they don't carry any statistical importance for a particular task. Zipf also discovered that there is a negative correlation between a word's length and its frequency as well as a positive correlation between age and frequency.
These simple, yet remarkable findings show that there is something intrinsically statistical in natural languages. It's not clear what explains these phenomena – maybe the principle of least effort that Zipf believes is a key characteristic of humans; maybe it's a greater mathematical law governing many things including language. But we shouldn't look for causality; it suffices to accept that these properties are measurable.
1948 an engineering breakthrough happened in the area of communication – the development of Information Theory by Claude Shannon. Shannon worked at the famous Bell Labs on communication systems and devised a mathematical theory of communication on noisy channels. He showed that regardless of the noise in a communication channel, it is possible to communicate error-free through it. This works up to a maximum bandwidth, defined by the actual bandwidth and the noise.
Information Theory provides the pillars for all modern communication systems and models communication as a stochastic process (a statistical model). Key concepts in Information Theory as "Entropy" or "Mutual Information" have been found to relate closely to and be useful in the processing of natural languages.
Information Theory provides, for example, the theoretical lower limit achievable by compressing a text in a particular language.
The language-philosophical argument – Wittgenstein and “meaning is use”
All these empirical proofs showing that the statistical modeling of language is actually not a farfetched idea give us more or less quantitative assurance that these methods have their entitlement to exist. Wittgenstein provides us with what could be called a qualitative argument that is based on how he understands the inner workings of natural languages.
Wittgenstein argues that for a very large class of words there are no strictly definable meanings, in a mathematical sense. The question "what is the meaning of areté?", for example, is a question one cannot answer in a satisfying way (ask Meno! [https://en.wikipedia.org/wiki/Meno]). This argument seems strange at first, since we're used to asking precisely this question – even more so when learning a new language. Wittgenstein claims that the meaning of a word is given by its use in the language.
"Man kann für eine große Klasse von Fällen der Benützung des Wortes "Bedeutung" – wenn auch nicht für alle Fälle seiner Benützung – dieses Wort so erklären: Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache."
Take as an example the German word "Spiel" (§66-§71), which is a noun very close to the English "game", but is used to describe also leisure activities small children do, as playing with a ball, or playing catch, etc. If one is asked what the meaning of "Spiel" is, one idea could be to try and find one single common feature permeating all kinds of "Spiel". But what would the single common feature be between a professional game of chess and the kicking of a ball against a garage door? Are they all entertaining? Think about the game of Russian roulette! Oh, maybe all "Spiele" have a winner and a loser? Who wins and who loses in a game of solitaire? Maybe it's all about agility? But compare the agility necessary in chess with the one needed in soccer; and what agility do you need for a game of Bingo? One will quickly realize that all "Spiele" are part of an incredibly complex web of relations, but I defy anyone to find one single common character shared by all "Spiele".
Wittgenstein calls this "Familienähnlichkeiten" (family resemblances). As members of a family share several characteristics, e.g. expressions, height, eye color, or temper, but never all of them at once, words in a family also share several features. Another very nice analogy is that of a thread. The thread is composed of thousands of fibers that interleave, some touching each other, most not touching each other at all, being very distant from another – but together composing one single object we call a thread. "Spiel" is one such thread, composed of many fibers (instances of a "Spiel"), some sharing portions of what makes them a "Spiel", some not.
The only way to explain the "meaning" of "Spiel" is to actually enumerate all known uses of "Spiel" in the language. One could say that "Spiel" has a fuzzy definition – but does this make the word useless? I believe not. I even do believe that this fuzziness is more informative and meaningful than if there was a clear definition of "Spiel".
Of course, this exercise can be expanded to several other words. The first time I read these aphorisms, they resonated with my experiences and I was very quickly very much convinced by them ("Yes! This is how I understand language!"). The beauty of this idea is that it provides a very convincing philosophical theory that justifies the approach of collecting examples of text (corpora) and trying to learn patterns from them, which is the way statistical NLP and machine learning work. Furthermore, Wittgenstein's argument of "meaning is use" implies that there is no other way to understand a large part of language; this is one fundamental nature of language, and this is how humans understand and use it.
And this is a damned good argument for doing our job the way we do it.
Breno Faria, Head of Development, ist seit 2012 für die IntraFind Software AG tätig. Seit den späten 2000er Jahren beschäftigt er sich intensiv mit den Themen Content Analytics und Information Retrieval. 2015 übernahm er die Rolle des Entwicklungsleiters bei IntraFind.
Im Rahmen von Veranstaltungen, z.B. Berlin Buzzwords 2014 oder "IntraFind Enterprise Search Day 2015", referiert er regelmäßig über neue Technologien oder präsentiert innovative Lösungen aus IntraFind Kundenprojekten.