Hive was the first tool to enable SQL on Hadoop and, up until this point, has primarily used MapReduce as its execution engine. Szehon explained the motivation behind the Hive on Spark effort and the benefits that the team contributing to it is seeing from Spark.
While there are a variety of newer SQL on Hadoop engines (Impala, SparkSQL, Presto, Drill) that offer improved performance, many organizations have large investments in Hive. Hive on Spark is an effort to modernize the execution engine underneath Hive, while retaining full HiveQL and metastore compatibility. Their goal was to improve the execution speed of Hive while retaining a smooth upgrade path for existing Hive users.
Typically, Hive executes its queries on top of Hadoop’s MapReduce framework, but SQL statements often translate into multiple Map and Reduce stages. At the end of each stage, the reducer “spills” the data down to disk (HDFS) to be reloaded by later map stages, resulting in much higher latency. Whereas with Spark, the in memory DAG execution model allows for multiple transformations on the data without spilling to disk between each stage.
Another benefit is that the Hive query planner now has a more expressive execution engine on which it can run queries. At the core of Spark is the Resilient Distributed Dataset (RDD). RDDs support a much broader set of transformations than just Map and Reduce.
Check out the slides for more details.
Many thanks to Szehon for coming out to Bronto to speak with TriHUG!