Posts

Showing posts from April, 2017

Apache Spark - crash course

Image
Working with NoSQL databases can be very inconvenient as we lose even the basic tools to get insights on our data. For instance, answering very simple questions like “How many customers bought a specific product?” can be nearly impossible, depending on the data structure we built. Simply put, removing GROUP BY,   JOIN, and WHERE  operators from a database would render it useless for ad-hoc queries. NoSQL databases have never had these operators in the first place, making them a non trivial choice for ad-hoc data analytics. Of course if engineering knew up front what kind of queries will hit the system, they could have denormalized the data to the point that these very specific questions can be answered easily, but ad-hoc queries would be still impossible. Spark to help Apache Spark is a very interesting concept: it is a distributed data access engine that can work on multiple underlying data stores, providing a consistent API for the developers. For instance, it can run on