Monday, April 22, 2013

Simple statistical language detection

Given one sample news article in English, create a statistical analyzer in 30 minutes that is able to reliably recognize English text. Before we begin let’s consider that even though not everyone speaks French almost all of us are able to recognize it.

The rules of a language are visible from very low level even up to the culture itself, so creating a simple language detector that deals with the lower levels of the language shouldn’t be too complex – well, at least in theory.



The problem

One of the projects I’m currently working on requires me to detect whether the user input is in English to make sure that the user generated content is placed in the right category. To my bad luck Google has shut down its language detection service so I had to find something on my own.

After considering couple libraries I decided to investigate a bit: in theory, every language has very specific phonemes so I should be able to find the typical English sounds and match my incoming user content with that.

For instance, everyone knows that the “th” sound is quite common in English so a lot of words should it. Similarly, “he”, “er”, and “in” are frequent phonemes too. I wanted to go a bit higher and try to catch 3-grams too, so I found that “the”, “ear” and “ion” is quite heavily used as well.

The process

To collect the n-grams all I needed to do is to feed couple of CNN articles as English samples to my processor. The processor split the sample into individual lowercase words and inferred 2-grams and 3-grams from it. To break down the word “theory”, it emitted the following chunks:

theory
th
 he
  eo
   or
    ry
the
 heo
  eor
   ory

As a first step all I did this parsing for every word in the sample and counted how frequent an n-gram was.

Deciding on a sentence

As there are many words that used across different languages (like “must” in English and Hungarian or “dies” in English and German) it’s not sufficient to decide the language based on a single word. However, one sentence gives us confidence whether it must be English or possibly something else.

To decide on a single word, the created analyzer infers all the 2-grams and 3-grams from the word and tries to match them to the dictionary. If it was found, it adds a weighted value to the sum but if it was not found or fell below a threshold (very unlikely combinations) it added a negative value to the sum. If it was too short it was simply ignored (like “a” and “I”). A sentence’s result was the average of its words sum:

Finally I'm doing something I'm interested in.
52      NaN -2    78        NaN 104        4
AVG: 47.2

In the attached simplified sample code anything above ~10 is quite confident to be in English.

Multiple languages

If we need to detect multiple languages a more sophisticated model is required but the basic is pretty much the same: we check the input with every model and give back the most probable result. It still might be off though, as any simplistic language detector might fail to recognize couple of words, especially if the sentence is short. However, this approach does not take into account that maybe the phoneme “is” is quite frequent in many languages so it does not really determine which language we are seeing.

Multiple languages with Bayes’ theorem

Thomas Bayes back in the 18th century noticed that a dependent probability of

P(A|B) 

which reads as “probability of A given B happened” can be rewritten as:

P(A|B) = P(B|A) * P(A) / P(B)

With languages, this probability can be read as: if we see a phoneme how likely is it that it's from a specific language, given how likely the language itself is, how likely the phoneme generally is and how frequent that phoneme in that specific language:

P(English|”is”) = P(is”|English) * P(English) / P(is”)

If we create a larger collection of phonemes for different languages, we easily can infer P(“is”) as it’s the sum probability of “is” across all languages. P(English) is a little bit trickier as it really depends on the context – in my case I gave it 80% (0.8) as it’s somewhat unlikely that a user would contribute non-English content to an English category. For the P(“is”|English) we just check how frequent “is” is in English dictionary.

To substitute with actual numbers in a strictly bilingual analytics in English and French dictionary: the phoneme “ex” was found 20/7159 times in the French collection and 6/9754 in the English collection – obviously it’s more frequent in French.

P(English|”ex”) = P(ex”|English) * P(English) / P(ex”)
6/9754 * 0.8 / (20/7159 + 6/9754)
14.4%

If we only have a look at probabilities separately, we find significantly different probabilities:

P(ex”|English) = 0.061%
P(ex”|French) = 0.279%

However, because English is much more likely as a user input (80% in our case), the probability of a word being English even though it contains “ex” is 14.4%.

On the other hand, let’s see how likely it is that the word is a French one:

P(French|”ex”) = P(ex”|French) * P(French) / P(ex”)
20/7159 * 0.2 / (20/7159 + 6/9754)
16.4%

As a conclusion in our case seeing the “ex” phoneme in a word is not too significant as it gives roughly the same chance to both languages. However, seeing “de” or “le” are much better signs of seeing French rather than English text and “th” is a very strong sign that we are seeing an English word (73.5% vs 1.62%).

Code sample

The following logic is an overly simple implementation of language detector without the Bayes’ probabilities. However, the results are surprisingly good even though the teaching sample is only one article from New York Times or France24.

var detect = {};

/**
 * Slices a sentence into a word array/
 */
detect.chunk = function (s) {
    s = s.toLowerCase();
    var sarr = s.split(/[ ,.;"'!?]/g);
    return sarr;
}

/**
 * Creates n-grams from a word and stores them in the dict.
 */
detect.ngram = function (w, n, dict) {
    for (var pos = 0; pos <= w.length - n; pos++) {
        var ng = w.substr(pos, n);
        dict.total++;
        if (dict.dict[ng]) {
            dict.dict[ng]++;
        } else {
            dict.dict[ng] = 1;
        }
    }
}

/**
 * Builds the probabilistic dictionary for a language based on sample text.
 */
detect.buildDict = function (s) {
    var sarr = detect.chunk(s);
    var d = {
        total: 0,
        dict: {}
    };
    var dict = d.dict;

    for (var i = 0; i < sarr.length; i++) {
        var word = sarr[i];
        if (!word) {
            continue;
        }

        // Math any word looking character - good for odd accent characters.
        if (word.match(/[\s\d,./\<\>?;':"\[\]{}\\|=\+\-_`~!@#$%&*\(\)]/)) {
            continue;
        }

        detect.ngram(word, 2, d);
        detect.ngram(word, 3, d);
    }

    return d;
}

/**
 * Analyzes a single word with the given dictionary.
 */

detect.analyze = function (s, dict) {
    var ngrams = {
        total: 0,
        dict: {}
    };
    detect.ngram(s, 2, ngrams);
    detect.ngram(s, 3, ngrams);
    var sum = 0;
    for (var chunk in ngrams.dict) {
        var val = dict.dict[chunk];

        // Empirical values - Bayes' approach is much better.
        if (val > 2) {
            sum += Math.pow(chunk.length, 2);
        } else {
            sum -= Math.pow(2 + (2 / chunk.length), 2);
        }
    }
    return parseInt(sum);
}

/**
 * Analyzes the sentence with the given dictionary.
 */
detect.analyzeAll = function (s, dict) {
    var sarr = detect.chunk(s);
    var res = [];
    sum = 0;
    for (var i = 0, word; word = sarr[i]; i++) {
        if (!word) {
            continue;
        }
        var r = detect.analyze(word, dict);
        if (r) {
            res.push(r);
            sum += r;
        }
    }

    return [sum / res.length, res];
}

var sampleEN = "##YOUR ENGLISH SAMPLE TEXT HERE##";

var dict = detect.buildDict(sampleEN);

console.log(detect.analyzeAll("This is a typical sentence with bells and whistles", dict));

1 comment: