## Wednesday, August 16, 2017

### Building statistical machine translation based on sample bilingual text corpus

Training a weighted graph between the source and destination words and phrases in a sample bilingual text corpus yields an acceptable statistical machine translation dictionary.

Following up on my previous post about language features, I felt that the simple sentence to sentence translation yielded too few translations, even though they were correct. I wanted to build something that creates a much bigger dictionary, even if it's noisier: false positives are still better than no translation at all.

## Larger graph

Using the same sample, Don Juan from the Gutenberg project and mapping the Spanish source to the English translation I used a different approach this time: map each word and phrase in the source sentence to every word and phrase in the target sentence, adding more and more weights to transitions that keep appearing in the corpus.

To quickly illustrate the idea, let's consider the following input and output sentence where each letter represents a word:

a b c -> w x y z

Without understanding what maps to what let's simply create all possible mappings with the following rules:
- 1:1 - map all single word to all single words in the output.
- 1:2 - map all single words in the input to all two-word phrases in the output
- 2:1 - map all two-word phrases in the input to every single word in the output
- apply the above logic to 2:2, 3:2, 2:3, and 3:3.
These steps are necessary to cater for the differences in fertility in the languages and phrases.

Applying the above steps to the sample sentence would yield a total of 39 transitions so I was expecting Don Juan graph to be reasonably large. And it was, the graph had a total of 1,349,175 unique edges with 1,576,325 training transitions added, meaning in 227,150 cases an edge received 2 or more reinforcements.

## Weighting the edges

Adding the edges without any further consideration, as in every training simply increasing the weight of the edge by a fixed value, didn't really give usable results, it was too skewed and too noisy.

To make more sense I had to consider adding the following weighting to the edge:
- the shorter the sentence is, the more relevant the transition is
- the closer the lengths of the phrases are, the more likely the transition is, as in a 1:1 mapping is more likely than a 1:2
- the closer the phrases are to each other in each sentence the more likely the transition

More formally:

weight += 1 / (1 + len(source sentence)^2 + diff(sourcePos, destinationPos) + diff(source words count, dest words count))

## Results

While there is definitely space for improvement on weights, the results were more than adequate. After generating the sample translation with a weight cut-off I ended up with 20,223 translations. A lot of these were obviously just noise but the more obvious translations were there and (at least somewhat) correct. For instance:

qué -> ['what', 'what's', 'you']
sí -> ['yes', 'the', 'no']
no -> ['no', 'i', 'don't']
la -> ['the', 'you', 'i']
el -> ['the', 'and', 'i']
vamos -> ['go', 'let's', 'let's go']
yo -> ['i', 'i am', 'am']
yo soy -> ['that's me', 'that's', 'i am']
estoy -> ['i', 'am', 'am i']
estoy yo -> ['am i', 'am i not', 'am']
qué pasa -> ['what's happening', 'what's', 'happening']
extraño -> ['strange', 'a', 'man']
es su casa -> ['is his house', 'is his', 'this is his']
quiero hablar con -> ['i wish to', 'i wish', 'wish to speak']
vete -> ['go', 'go in', 'in']
bebamos -> ['lets', 'lets drink', 'drink']
por qué -> ['why', 'for what', 'for']
la mesa -> ['the', 'table', 'the table']
en la mesa -> ['at the table', 'at the', 'part at the']

Full Spanish - English translation available here.

(A very small subset of the new translation graph, highlighting the most confident ones with red.)

## Limitations

The used sample was very small which yielded in a very noisy output with a lot of incorrect translations. It's generally a problem with statistical machine translations as they would translate practically anything, but frequently incorrectly without even knowing it.
The language used in Don Juan is somewhat archaic so a lot of modern phrases were not present at all.
Also, the mapping didn't take anything into account that was longer than 3 words so certain phrases were incorrectly mapped because of this chunking.

(Dedicated to P. Thanks!)