Simple statistical language detection
Given one sample news article in English, create a statistical analyzer in 30 minutes that is able to reliably recognize English text. Before we begin let’s consider that even though not everyone speaks French almost all of us are able to recognize it. The rules of a language are visible from very low level even up to the culture itself, so creating a simple language detector that deals with the lower levels of the language shouldn’t be too complex – well, at least in theory. The problem One of the projects I’m currently working on requires me to detect whether the user input is in English to make sure that the user generated content is placed in the right category. To my bad luck Google has shut down its language detection service so I had to find something on my own. After considering couple libraries I decided to investigate a bit: in theory, every language has very specific phonemes so I should be able to find the typical English sounds and match my incoming user conte