You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Devadutta Ghat <de...@cloudera.com> on 2016/04/29 01:36:46 UTC

SQL-on-Apache Hadoop – Choosing the right tool for the right job

Hi All,

We have put together a simple guide to help you choose between different
SQL-on-Apache Hadoop tools. You can find it here
https://vision.cloudera.com/sql-on-apache-hadoop-choosing-the-right-tool-for-the-right-job/

I have included the blog content here as well. Hope this helps.

Thanks
Devadutta

SQL-on-Apache Hadoop – Choosing the right tool for the right job
April 26, 2016
<https://vision.cloudera.com/sql-on-apache-hadoop-choosing-the-right-tool-for-the-right-job/>By
Devadutta Ghat <https://vision.cloudera.com/author/dghat/>
Categories: Product <https://vision.cloudera.com/category/product/>
A simple guide to understanding SQL-on-Hadoop choices

When it comes to SQL-on-Hadoop, it is easy to feel overwhelmed with the
number of choices available in the ecosystem. However, in reality, choosing
the right tool to best help you tackle the job at hand can be simple if you
know what you are planning to achieve, which in most cases is typically
apparent from your job title or team function. There are three main classes
of SQL-on-Hadoop tools available – ETL and Data Preparation Tools, Analytic
Databases, and Data Engineering Tools. Within the Hadoop ecosystem, the
prominent choices that emerge for each of these are Apache Hive, Apache
Impala (incubating), and Spark SQL, respectively.

Here is a quick summary:

[image: SQL-On-Hadoop]
<http://vision.cloudera.com/wp-content/uploads/2016/04/Screen-Shot-2016-04-25-at-10.22.08-AM.png>

Now, let’s look at each of these in more detail.
*ETL and Data Preparation Tool: Hive*

Hive was designed to make MapReduce batch processing accessible to a wider
group of users via SQL-like language – HiveQL and consequently, Hive is
best-suited for the same use cases as MapReduce. Hive-on-Spark provides the
same great Hive capabilities, using the next-generation MR-style execution
engine of Apache Spark instead of MapReduce for incrementally faster
processing (Hive-on-Tez provides similar capabilities but using Tez instead
of Spark). Compared to MapReduce, Spark provides easier development with a
rich set of APIs in Java, Scala and Python to make use of Spark’s DAG
engine, which delivers faster processing by in-memory caching of data
between iterations.

Hive’s ability to run scheduled batch jobs with massive ETL sorts with
joins makes it one of the most widely used data preparation tool for Hadoop
and most data served to BI users in Impala today is prepared by ETL
developers using Hive.
*Data Engineering Tool: Spark and Spark SQL*

Spark is a powerful data engineering framework with a rich a rich set of
developer APIs from core Spark primitives, SQL APIs, machine learning
algorithms with MLlib, streaming APIs, Python/R interfaces with Spark-R and
PySpark etc.

At its core Spark is a next generation MR-style system where instead of
being confined to simple Map and Reduce operators, you can develop a more
robust set of Spark operators using RDDs as operator interchanges instead
of a forced Shuffle stage. SparkSQL, like the rest of Spark, translates SQL
into underlying stages of Spark operations on RDDs. This design easily
allows the intermixing of SQL with other Spark APIs so data engineers can
use SQL for common data preparation tasks like aggregations, joins, and
filtered as part of a broader Spark application that might use core Spark
APIs and then feed into a machine learning algorithm from MLib.

For example, let us consider iteratively building logistic regression over
a data-set using Spark and compare it with MapReduce. With Spark, you can
embed SQL within your Spark app written in Scala or Java and use MLlib to
build the regression model. As seen in the graph below, Spark performs much
better by caching data in memory between iterations compared to MapReduce.
[image: Logistic Regression Performance (Data Fits in Memory)]
<http://vision.cloudera.com/wp-content/uploads/2016/04/Picture13.png>

Logistic Regression Performance (Data Fits in Memory)

In a recent announcement about The One Platform Initiative
<http://vision.cloudera.com/one-platform/>, we outlined our Spark roadmap
doubling down on security, scale, management, and streaming. With its
developer friendliness and much better performance, Spark is set to
supplant MapReduce. Many of our customers like AllState
<http://www.cloudera.com/resources/video/allstate-speaks-at-cloudera-sessions.html>
 and Cox Automotive
<http://cloudera.com/content/www/en-us/customers/cox-automotive.html> have
already leveraged Spark to boost their data processing pipelines.
*Analytic Database: Impala*

Impala is modern MPP query engine that’s purpose-built for Hadoop to
provide BI and SQL analytics at interactive latencies. Analytic databases
like Impala are used by data analysts and BI users, typically in
conjunction with tools like Tableau, Zoomdata, or Microstrategy. For these
users, there’s a *HUGE* advantage of clicking on a report or visualization
and getting a response in seconds, versus having to wait minutes. Thus,
interactivity is critical and must be maintained as these tools scale to
hundreds or even thousands of users.

In a recent performance metric, we compared Hive, Impala, and Spark SQL
with analytic workloads. Based on the discussion we’ve had here, it should
not come as a surprise that Impala — designed for analytic workloads  —
performs best where interactive query latencies over multiple users is
critical.

[image: Picture2]
<http://vision.cloudera.com/wp-content/uploads/2016/04/Picture22.png>

Not only is Impala the only one of the SQL-on-Hadoop engines that delivers
responses in seconds compared to minutes, but also Impala’s performance is
better when compared to traditional DBMS and analytic databases. Many of
our customers such as Quaero <http://www.cloudera.com/customers/quaero.html>
 and Epsilon <http://www.cloudera.com/customers/epsilon.html> have
benefited first hand from Impala’s performance lead when they compared it
to traditional systems such as Netezza. Like we have discussed in the past,
performance is an ongoing priority for Impala
<https://blog.cloudera.com/blog/2015/07/whats-next-for-impala-more-reliability-usability-and-performance-at-even-greater-scale/>
and,
the recent release of Impala 2.5 sees 4x better performance
<http://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop>
compared
to the previous version used in the benchmark above. With multi-core joins
and aggregates and a variety of other performance enhancements coming up,
we expect Impala’s performance lead to widen further on analytic workloads.
Conclusion

As you can see, Hive, Impala, and Spark SQL all provide best-of-breed, but
different, functionalities and all serve an important purpose within a
complete Hadoop platform to address multiple workloads.  No matter what the
task or role – be it data preparation, BI, or data science – there is the
right tool for you in the ecosystem. In fact many of our customers have
different groups of users using all three technologies. Today we find most
of the data served up for analytics by Impala is prepared by Hive and 80%
of our Spark customers also use Impala to serve up their data to their
analysts while their data engineers and data scientistic use Spark on this
common data for statistical analysis.

To learn more about these technologies, check out the webinar “SQL-on-Hadoop
Technologies in Cloudera
<http://www.cloudera.com/resources/recordedwebinar/hive-impala-and-spark-oh-my-sql-on-Hadoop-in-cloudera-5-5.html>
”