You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by sa...@apache.org on 2018/04/23 17:38:52 UTC
[04/20] impala git commit: IMPALA-6886: [DOCS] Removed
impala_cluster_sizing.xml
IMPALA-6886: [DOCS] Removed impala_cluster_sizing.xml
Change-Id: I03d605d33ed6ced809074b1fc96def30ad0887fd
Reviewed-on: http://gerrit.cloudera.org:8080/10109
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/dfc17b86
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/dfc17b86
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/dfc17b86
Branch: refs/heads/2.x
Commit: dfc17b86a2df2e6a4675bce2d64676a29ec59231
Parents: 0ec3cd7
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Wed Apr 18 16:49:51 2018 -0700
Committer: Impala Public Jenkins <im...@gerrit.cloudera.org>
Committed: Thu Apr 19 22:10:21 2018 +0000
----------------------------------------------------------------------
docs/impala.ditamap | 2 +-
docs/topics/impala_cluster_sizing.xml | 371 -----------------------------
2 files changed, 1 insertion(+), 372 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 7ef7d47..ec6d313 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -47,7 +47,7 @@ under the License.
</topicref>
<topicref href="topics/impala_planning.xml">
<topicref href="topics/impala_prereqs.xml#prereqs"/>
- <topicref href="topics/impala_cluster_sizing.xml"/>
+ <!-- Removed per Alan Choi's request on 4/18/2018 <topicref href="topics/impala_cluster_sizing.xml"/> -->
<topicref href="topics/impala_schema_design.xml"/>
</topicref>
<topicref audience="standalone" href="topics/impala_install.xml#install">
http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/topics/impala_cluster_sizing.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_cluster_sizing.xml b/docs/topics/impala_cluster_sizing.xml
deleted file mode 100644
index 7b395c5..0000000
--- a/docs/topics/impala_cluster_sizing.xml
+++ /dev/null
@@ -1,371 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="cluster_sizing">
-
- <title>Cluster Sizing Guidelines for Impala</title>
- <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
- <prolog>
- <metadata>
- <data name="Category" value="Impala"/>
- <data name="Category" value="Clusters"/>
- <data name="Category" value="Planning"/>
- <data name="Category" value="Sizing"/>
- <data name="Category" value="Deploying"/>
- <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
- <data name="Category" value="Sectionated Pages"/>
- <data name="Category" value="Memory"/>
- <data name="Category" value="Scalability"/>
- <data name="Category" value="Proof of Concept"/>
- <data name="Category" value="Requirements"/>
- <data name="Category" value="Guidelines"/>
- <data name="Category" value="Best Practices"/>
- <data name="Category" value="Administrators"/>
- </metadata>
- </prolog>
-
- <conbody>
-
- <p>
- <indexterm audience="hidden">cluster sizing</indexterm>
- This document provides a very rough guideline to estimate the size of a cluster needed for a specific
- customer application. You can use this information when planning how much and what type of hardware to
- acquire for a new cluster, or when adding Impala workloads to an existing cluster.
- </p>
-
- <note>
- Before making purchase or deployment decisions, consult organizations with relevant experience
- to verify the conclusions about hardware requirements based on your data volume and workload.
- </note>
-
-<!-- <p outputclass="toc inpage"/> -->
-
- <p>
- Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
- Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
- Because work can be distributed in different ways for different queries, if some hosts are overloaded
- compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
- and overall slowness
- </p>
-
- <p>
- For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
- 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
- DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
- similar to TPC-DS benchmark queries:
- </p>
-
- <table>
- <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
- <tgroup cols="6">
- <colspec colnum="1" colname="col1"/>
- <colspec colnum="2" colname="col2"/>
- <colspec colnum="3" colname="col3"/>
- <colspec colnum="4" colname="col4"/>
- <colspec colnum="5" colname="col5"/>
- <colspec colnum="6" colname="col6"/>
- <thead>
- <row>
- <entry>
- Data Size
- </entry>
- <entry>
- 1 query
- </entry>
- <entry>
- 10 queries
- </entry>
- <entry>
- 100 queries
- </entry>
- <entry>
- 1000 queries
- </entry>
- <entry>
- 2000 queries
- </entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>
- <b>250 GB</b>
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 5
- </entry>
- <entry>
- 35
- </entry>
- <entry>
- 70
- </entry>
- </row>
- <row>
- <entry>
- <b>500 GB</b>
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 10
- </entry>
- <entry>
- 70
- </entry>
- <entry>
- 135
- </entry>
- </row>
- <row>
- <entry>
- <b>1 TB</b>
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 15
- </entry>
- <entry>
- 135
- </entry>
- <entry>
- 270
- </entry>
- </row>
- <row>
- <entry>
- <b>15 TB</b>
- </entry>
- <entry>
- 2
- </entry>
- <entry>
- 20
- </entry>
- <entry>
- 200
- </entry>
- <entry>
- N/A
- </entry>
- <entry>
- N/A
- </entry>
- </row>
- <row>
- <entry>
- <b>30 TB</b>
- </entry>
- <entry>
- 4
- </entry>
- <entry>
- 40
- </entry>
- <entry>
- 400
- </entry>
- <entry>
- N/A
- </entry>
- <entry>
- N/A
- </entry>
- </row>
- <row>
- <entry>
- <b>60 TB</b>
- </entry>
- <entry>
- 8
- </entry>
- <entry>
- 80
- </entry>
- <entry>
- 800
- </entry>
- <entry>
- N/A
- </entry>
- <entry>
- N/A
- </entry>
- </row>
- </tbody>
- </tgroup>
- </table>
-
- <section id="sizing_factors">
-
- <title>Factors Affecting Scalability</title>
-
- <p>
- A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
- node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
- cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
- memory.
- </p>
-
- <p>
- If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
- the network load; in fact, a larger cluster could increase network traffic because some queries involve
- <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
- throughput in a network-constrained environment.
- </p>
-
- <p>
- Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
- concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
- network is saturated yet. This can happen because currently Impala uses only a single core per node to
- process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
- cannot run more than 2 such queries at the same time.
- </p>
-
- <p>
- Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
- workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
- GB/sec. Consider increasing the memory per node.
- </p>
-
- <p>
- As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
- throughput estimate.
- </p>
- </section>
-
- <section id="sizing_details">
-
- <title>A More Precise Approach</title>
-
- <p>
- A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
- size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
- data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
- </p>
-
-<codeblock>Eq 1: N > QPM * D / 100 GB
-</codeblock>
-
- <p>
- Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
- required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
- estimate the number of node using our equation above.
- </p>
-
-<codeblock>N > QPM * D / 100GB
-N > 400 * 50GB / 100GB
-N > 200
-</codeblock>
-
- <p>
- Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
- </p>
-
- <p>
- Depending on the complexity of the query, the processing rate of query might change. If the query has more
- joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
- process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
- filtering on numbers, the processing rate can be higher.
- </p>
- </section>
-
- <section id="sizing_mem_estimate">
-
- <title>Estimating Memory Requirements</title>
- <!--
- <prolog>
- <metadata>
- <data name="Category" value="Memory"/>
- </metadata>
- </prolog>
- -->
-
- <p>
- Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
- joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
- STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
- below to calculate the minimum memory requirement.
- </p>
-
- <p>
- Suppose you are running the following join:
- </p>
-
-<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
-from a, b
-where a.key = b.key
-and b.col_1 in (1,2,4...)
-and b.col_4 in (....);
-</codeblock>
-
- <p>
- And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
- </p>
-
- <p>
- The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
- filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
- than the total memory of the entire cluster.
- </p>
-
-<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
- selectivity factor from the predicate *
- projection factor * compression ratio
-</codeblock>
-
- <p>
- In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
- predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
- the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
- Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
- </p>
-
-<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
- selectivity factor from the predicate *
- projection factor * compression ratio
- = 100TB * 10% * 5/200 * 3
- = 0.75TB
- = 750GB
-</codeblock>
-
- <p>
- So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
- TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
- of this magnitude.
- </p>
- </section>
- </conbody>
-</concept>