You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Devadutta Ghat <de...@cloudera.com> on 2016/04/26 23:09:47 UTC

Apache Impala (incubating) in CDH 5.7: 4x Faster for BI Workloads on Apache Hadoop

Hi All,

We have published our performance benchmarks for Impala 2.5 (released with
CDH 5.7) compared to Impala 2.3 here:
http://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/

Please let me know if you have any questions or comments. I have included
full text of the blog here for your convenience.

Thanks
Devadutta

Apache Impala (incubating) in CDH 5.7: 4x Faster for BI Workloads on Apache
Hadoop
April 26, 2016
<http://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/>By
Devadutta Ghat, Marcel Kornacker, Mostafa Mokhtar, and Henry Robinson
<http://blog.cloudera.com/?guest-author=Devadutta%20Ghat,%20Marcel%20Kornacker,%20Mostafa%20Mokhtar,%20and%20Henry%20Robinson>No
Comments
<http://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/#comments>
Categories: CDH <http://blog.cloudera.com/blog/category/cdh/> Impala
<http://blog.cloudera.com/blog/category/impala/> Performance
<http://blog.cloudera.com/blog/category/performance/>

*Impala 2.5, now shipping in CDH 5.7, brings significant performance
improvements and some highly requested features.*

Impala <http://impala.io/> has proven to be a high-performance analytics
query engine since the beginning. Even as an initial production release in
2013, it demonstrated performance 2x faster
<http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/>
than
a traditional DBMS, and each subsequent release has continued to demonstrate
<http://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/>
the
wide performance gap between Impala’s analytic-database architecture and
SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that
track record via some important performance gains for Impala 2.5 (with more
to come on the roadmap), summarized below.

Overall, compared to Impala 2.3, in Impala 2.5:

   - TPC-DS queries run on average *4.3x faster*.
   - TPC-H queries run *2.2x faster on flat tables*, and *1.71x faster on
   nested tables*.

[image: impala25-f1]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f1.png>

*Figure 1: Performance comparison of Impala 2.5 vs Impala 2.3 on TPC-DS and
TPC-H. (See more details about workloads in Appendix.)*

Next, let’s review some of the new performance enhancements, as well as
some other significant features, in detail. (For a complete list, check out
the release notes here
<http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html#new_features_250>
.)
Runtime Filters and Dynamic Partition Pruning

One of the important strategies in SQL query optimization is to find ways
to reduce the number of rows in leaf nodes in the execution-plan tree.
Runtime filters in Impala 2.5 do exactly that by computing filters during
query execution and propagating them from “upstream” to “downstream”
operators. Doing that reduces the amount of work performed not only by the
scanner but also by upstream operators. When applied on a partition column,
runtime filters automatically prune partitions (this special type of
runtime filtering is called *dynamic partition pruning*) before Impala even
starts scanning the tuples of a partitioned table, significantly reducing
the I/O cost of executing a query.

[image: impala25-f2]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f2.png>*Figure
2: With runtime filters, Impala 2.5 is on average 2.2x faster than 2.3 on
TPC-H benchmark queries (on a 1TB dataset), with TPC-H Q17 running 23x
faster.*

Learn more about runtime filters and dynamic partition pruning in the docs
<http://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html>
.
Faster Query Startup

In previous releases, when queries started execution, Impala would start
individual fragments one “level” of the plan tree at a time to ensure that
receivers of data were always ready when the senders started. This approach
led to a long start-up delay, particularly for complex queries with many
fragments (or in large clusters). In Impala 2.5, instead of starting
fragments in wave after wave, the query start-up logic allows fragments to
be started in any order, thereby increasing parallelism and reducing query
start-up latencies.

Figure 3 shows the resulting performance boost in query-startup times,
which often translate to a speedup in overall query execution.

[image: impala25-f3]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f3.png>*Figure
3: Query start-up time in Impala 2.5 is significantly faster than in Impala
2.3.*
Improved Cardinality Estimates and Join Order

Impala 2.5 brings a more robust scan cardinality estimation by mitigating
correlated predicates with an exponential back-off. Furthermore, join
cardinality estimation is also improved. These improvements, along with a
better join order selection (broadcast vs. shuffle), deliver up to 8x gains
for complex analytical queries like TPC-H Q8.
LLVM Codegen Coverage

Impala 2.5 adds LLVM codegen support for SORT and Top-N operations as well
as arithmetic operations onDECIMALs. As both SORT and Top-N are commonly
used functions by many third-party BI tools with which Impala integrates,
these improvements in codegen will be readily noticeable by end-users.

[image: impala25-f4]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f41.png>

*Figure 4: With codegen for Top-N, Impala 2.5 is 1.72x more efficient on
this benchmark query.*

Learn more about LLVM codegen support in Impala here
<https://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/>
.
Catalog Improvements: Incremental Metadata Update

Impala 2.5 brings several catalog improvements that allow incremental
update of table metadata instead of force-reloading all table metadata
during DDL/DML operations. By reloading metadata of only “dirty” partitions
and reusing descriptors of HDFS files to avoid loading file/block metadata
for files that haven’t been modified, Impala 2.5 significantly reduces the
latency of DDL/DML operations (by up to 4x).

[image: impala25-f5]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f5.png>*Figure
5: Impala 2.5 is on average 4x faster on catalog-update stress tests than
2.3.*
Improvements to Admission Control

Since version 1.4, Impala has provided admission control capabilities to
let users throttle workloads to avoid oversubscription and to maximize
throughput based on number of concurrent queries. Impala 2.5 supports
memory-based admission control that can be configured and monitored
directly from Cloudera Manager; read more about managing the Admission
Control feature in the docs
<http://www.cloudera.com/documentation/enterprise/latest/topics/admin_impala_admission_control.html>
.
Improved Scalability and Reliability

One of the top priorities for Cloudera Engineering over the past year has
been to provide even better scalability and reliability
<http://blog.cloudera.com/blog/2016/03/quality-assurance-at-cloudera-static-source-code-analysis/>.
Impala 2.5 is huge step forward in that direction. Impala 2.5 went through
rigorous stress, scale, fault-injection, and performance testing using a
new in-house query generator specifically built for Impala.
Distributed Aggregations

Aggregations in Impala happens in two phases: pre-aggregation and merge.
The pre-aggregation phase greatly reduces network traffic if there are many
input rows per grouping value, but sometimes they are not effective and can
spill to disk under memory pressure. Impala 2.5 improves aggregations by
deciding at run time whether it is more efficient to do an initial
aggregation phase and pass along a smaller set of intermediate data, or to
pass raw intermediate data back to next phase of query processing to be
aggregated there. Doing so, Impala 2.5 speeds up certain aggregations of up
to 25% and reduces memory consumption up to 50% or more. In addition,
streaming pre-aggregations don’t spill to disk.

[image: impala25-f6]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f62.png>*Figure
6: In Impala 2.5, some aggregations are up to 25% faster.*
DECIMAL Arithmetic Improvements

Starting in Impala 1.4, the DECIMAL data type lets you store
fixed-precision values, which are commonly used for working with currency
values where it is important to represent values exactly and avoid rounding
errors. Impala 2.5 speeds up aggregations involving DECIMAL fields by
automatically triggering native code generation.

As shown in Figure 7, on the benchmark query
<https://gist.github.com/anonymous/b3aac9555a30d17b7ccf6961419e8ef8> Impala
2.5 runs at least 3x faster on aggregations involvingDECIMAL compared to
Impala 2.3. Impala 2.5 bridges the performance gap between DECIMAL andFLOAT/
DOUBLE, letting you run high-precision operations much faster.

[image: impala25-f7]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f7.png>*Figure
7: Impala 2.5 runs at least 3x faster on DECIMAL-related aggregations
compared to 2.3.*

Learn more about this feature in the docs
<http://www.cloudera.com/documentation/enterprise/latest/topics/impala_decimal.html>
.
Metadata-only Queries

Impala 2.5 speeds up min(), max(), ndv(), and aggregate functions with
distinct keywords that involve only the partition-key columns of
partitioned tables by using metadata to avoid table accesses for
partition-key scans. These functions are commonly used by third-party BI
tools, so most end-users will experience a noticeable improvement in
performance.

[image: impala25-f8]
<http://blog.cloudera.com/wp-content/uploads/2016/04/impala25-f8.png>*Figure
8: Performance of metadata-only queries is significantly improved in Impala
2.5.*
Conclusion

Performance has always been a top priority for the Impala team because
interactive query latency is so important for BI and analytics users. Less
than a year ago, the team outlined its roadmap
<https://blog.cloudera.com/blog/2015/07/whats-next-for-impala-more-reliability-usability-and-performance-at-even-greater-scale/>
for
2015-2016 and these performance improvements reflect a huge step forward on
that plan. That step is just the first of several more this year, however:
with several other performance optimizations (such as multi-core joins and
aggregations) on the community’s roadmap, the goal of 20x-plus performance
gains for Impala is in view.

If you’re interested in contributing to Impala to help bring this vision to
fruition, those contributions are very welcome!

*Devadutta Ghat is a Senior Product Manager at Cloudera.*

*Marcel Kornacker is a Tech Lead at Cloudera and the founder of Impala.*

*Mostafa Mokhtar is a Software Engineer at Cloudera.*

*Henry Robinson is a Software Engineer at Cloudera.*
Appendix

TPC-DS and TPC-H benchmark details:

   - 24 unmodified TPC-DS queries {3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47,
   52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98} on 15TB dataset
   - 22 TPC-H queries on 15TB dataset