Posts

Showing posts from 2017

Building statistical machine translation based on sample bilingual text corpus

Image
Training a weighted graph between the source and destination words and phrases in a sample bilingual text corpus yields an acceptable statistical machine translation dictionary. Following up on  my previous post about language features , I felt that the simple sentence to sentence translation yielded too few translations, even though they were correct. I wanted to build something that creates a much bigger dictionary, even if it's noisier: false positives are still better than no translation at all. Larger graph Using the same sample, Don Juan from the Gutenberg project and mapping the Spanish source to the English translation I used a different approach this time: map each word and phrase in the source sentence to every word and phrase in the target sentence, adding more and more weights to transitions that keep appearing in the corpus. To quickly illustrate the idea, let's consider the following input and output sentence where each letter represents a word: a b c

Detecting language features and building machine translation based on sample bilingual corpus

Image
Representing the language as a directed graph can give us insights on how the language is structured, what are the typical phrases and potentially, if this graph can be matched to another graph in a different language, automatically build up a machine translation graph. Language as a graph Let's say we analyse the following two simple sentences: "The cat is brown." "Don't play with the cat." These would yield the following directed graph: As we can see the transition from " the " node to the " cat " node has a weight of 2 because in this sample the most likely transition from the word " the " is to the word " cat ". Given large enough text corpus the typical phrases in a language would become obvious and if we give correct weights to the transitions while removing the noise the graph would actually be meaningful. Analysing Don Juan The first task was to obtain a large enough homogeneous corpus in

Apache Spark - crash course

Image
Working with NoSQL databases can be very inconvenient as we lose even the basic tools to get insights on our data. For instance, answering very simple questions like “How many customers bought a specific product?” can be nearly impossible, depending on the data structure we built. Simply put, removing GROUP BY,   JOIN, and WHERE  operators from a database would render it useless for ad-hoc queries. NoSQL databases have never had these operators in the first place, making them a non trivial choice for ad-hoc data analytics. Of course if engineering knew up front what kind of queries will hit the system, they could have denormalized the data to the point that these very specific questions can be answered easily, but ad-hoc queries would be still impossible. Spark to help Apache Spark is a very interesting concept: it is a distributed data access engine that can work on multiple underlying data stores, providing a consistent API for the developers. For instance, it can run on