February TriHUG: Hive on Spark

Bronto hosted February’s Triangle Hadoop User Group (TriHUG), featuring Szehon Ho from Cloudera talking about Hive on Spark.

TriHUG FebruaryHive was the first tool to enable SQL on Hadoop and, up until this point, has primarily used MapReduce as its execution engine. Szehon explained the motivation behind the Hive on Spark effort and the benefits that the team contributing to it is seeing from Spark.

While there are a variety of newer SQL on Hadoop engines (Impala, SparkSQL, Presto, Drill) that offer improved performance, many organizations have large investments in Hive. Hive on Spark is an effort to modernize the execution engine underneath Hive, while retaining full HiveQL and metastore compatibility. Their goal was to improve the execution speed of Hive while retaining a smooth upgrade path for existing Hive users.

Typically, Hive executes its queries on top of Hadoop’s MapReduce framework, but SQL statements often translate into multiple Map and Reduce stages. At the end of each stage, the reducer “spills” the data down to disk (HDFS) to be reloaded by later map stages, resulting in much higher latency. Whereas with Spark, the in memory DAG execution model allows for multiple transformations on the data without spilling to disk between each stage.

Another benefit is that the Hive query planner now has a more expressive execution engine on which it can run queries. At the core of Spark is the Resilient Distributed Dataset (RDD). RDDs support a much broader set of transformations than just Map and Reduce.

Check out the slides for more details.

Many thanks to Szehon for coming out to Bronto to speak with TriHUG!

Meetups @ Bronto

What is a meetup you ask?  meetupThe general idea is one of community.  It facilitates a way for people of common interests to come together, share ideas, network, and have fun doing it. These principles represent core values here at Bronto. With continued growth in our space here at American Tobacco Campus, we have welcomed the greater community to share in it.

This post highlights many of those groups that are working to help make a difference in the Triangle through education.

Continue reading

Apache Drill with Keys Botzum

Bronto is a proud sponsor and supporter of the Triangle Hadoop Users Group (TriHUG). Now over 400 members, TriHUG regularly hosts speakers from across the country, with topics related to the Hadoop ecosystem and large scale distributed systems in general. Talks are held at Bronto HQ in Durham, NC

Our July TriHUG meetup featured Keys Botzum from MapR Technologies talking shop about Apache Drill. Based on the Google Dremel paper, Drill is an incubating Apache project, led by contributors from MapR, Microsoft, Hortonworks and Oracle, with TriHUG’s own Grant Ingersoll as a mentor. Continue reading