Showing posts from August, 2017

Building statistical machine translation based on sample bilingual text corpus

Training a weighted graph between the source and destination words and phrases in a sample bilingual text corpus yields an acceptable statistical machine translation dictionary. Following up on  my previous post about language features , I felt that the simple sentence to sentence translation yielded too few translations, even though they were correct. I wanted to build something that creates a much bigger dictionary, even if it's noisier: false positives are still better than no translation at all. Larger graph Using the same sample, Don Juan from the Gutenberg project and mapping the Spanish source to the English translation I used a different approach this time: map each word and phrase in the source sentence to every word and phrase in the target sentence, adding more and more weights to transitions that keep appearing in the corpus. To quickly illustrate the idea, let's consider the following input and output sentence where each letter represents a word: a b c

Detecting language features and building machine translation based on sample bilingual corpus

Representing the language as a directed graph can give us insights on how the language is structured, what are the typical phrases and potentially, if this graph can be matched to another graph in a different language, automatically build up a machine translation graph. Language as a graph Let's say we analyse the following two simple sentences: "The cat is brown." "Don't play with the cat." These would yield the following directed graph: As we can see the transition from " the " node to the " cat " node has a weight of 2 because in this sample the most likely transition from the word " the " is to the word " cat ". Given large enough text corpus the typical phrases in a language would become obvious and if we give correct weights to the transitions while removing the noise the graph would actually be meaningful. Analysing Don Juan The first task was to obtain a large enough homogeneous corpus in