You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by sa...@apache.org on 2018/04/23 17:38:52 UTC

[04/20] impala git commit: IMPALA-6886: [DOCS] Removed impala_cluster_sizing.xml

IMPALA-6886: [DOCS] Removed impala_cluster_sizing.xml

Change-Id: I03d605d33ed6ced809074b1fc96def30ad0887fd
Reviewed-on: http://gerrit.cloudera.org:8080/10109
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/dfc17b86
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/dfc17b86
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/dfc17b86

Branch: refs/heads/2.x
Commit: dfc17b86a2df2e6a4675bce2d64676a29ec59231
Parents: 0ec3cd7
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Wed Apr 18 16:49:51 2018 -0700
Committer: Impala Public Jenkins <im...@gerrit.cloudera.org>
Committed: Thu Apr 19 22:10:21 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                   |   2 +-
 docs/topics/impala_cluster_sizing.xml | 371 -----------------------------
 2 files changed, 1 insertion(+), 372 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 7ef7d47..ec6d313 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -47,7 +47,7 @@ under the License.
   </topicref>
   <topicref href="topics/impala_planning.xml">
     <topicref href="topics/impala_prereqs.xml#prereqs"/>
-    <topicref href="topics/impala_cluster_sizing.xml"/>
+    <!-- Removed per Alan Choi's request on 4/18/2018 <topicref href="topics/impala_cluster_sizing.xml"/> -->
     <topicref href="topics/impala_schema_design.xml"/>
   </topicref>
   <topicref audience="standalone" href="topics/impala_install.xml#install">

http://git-wip-us.apache.org/repos/asf/impala/blob/dfc17b86/docs/topics/impala_cluster_sizing.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_cluster_sizing.xml b/docs/topics/impala_cluster_sizing.xml
deleted file mode 100644
index 7b395c5..0000000
--- a/docs/topics/impala_cluster_sizing.xml
+++ /dev/null
@@ -1,371 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="cluster_sizing">
-
-  <title>Cluster Sizing Guidelines for Impala</title>
-  <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Clusters"/>
-      <data name="Category" value="Planning"/>
-      <data name="Category" value="Sizing"/>
-      <data name="Category" value="Deploying"/>
-      <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
-      <data name="Category" value="Sectionated Pages"/>
-      <data name="Category" value="Memory"/>
-      <data name="Category" value="Scalability"/>
-      <data name="Category" value="Proof of Concept"/>
-      <data name="Category" value="Requirements"/>
-      <data name="Category" value="Guidelines"/>
-      <data name="Category" value="Best Practices"/>
-      <data name="Category" value="Administrators"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p>
-      <indexterm audience="hidden">cluster sizing</indexterm>
-      This document provides a very rough guideline to estimate the size of a cluster needed for a specific
-      customer application. You can use this information when planning how much and what type of hardware to
-      acquire for a new cluster, or when adding Impala workloads to an existing cluster.
-    </p>
-
-    <note>
-      Before making purchase or deployment decisions, consult organizations with relevant experience
-      to verify the conclusions about hardware requirements based on your data volume and workload.
-    </note>
-
-<!--    <p outputclass="toc inpage"/> -->
-
-    <p>
-      Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
-      Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
-      Because work can be distributed in different ways for different queries, if some hosts are overloaded
-      compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
-      and overall slowness
-    </p>
-
-    <p>
-      For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
-      12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
-      DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
-      similar to TPC-DS benchmark queries:
-    </p>
-
-    <table>
-      <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
-      <tgroup cols="6">
-        <colspec colnum="1" colname="col1"/>
-        <colspec colnum="2" colname="col2"/>
-        <colspec colnum="3" colname="col3"/>
-        <colspec colnum="4" colname="col4"/>
-        <colspec colnum="5" colname="col5"/>
-        <colspec colnum="6" colname="col6"/>
-        <thead>
-          <row>
-            <entry>
-              Data Size
-            </entry>
-            <entry>
-              1 query
-            </entry>
-            <entry>
-              10 queries
-            </entry>
-            <entry>
-              100 queries
-            </entry>
-            <entry>
-              1000 queries
-            </entry>
-            <entry>
-              2000 queries
-            </entry>
-          </row>
-        </thead>
-        <tbody>
-          <row>
-            <entry>
-              <b>250 GB</b>
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              5
-            </entry>
-            <entry>
-              35
-            </entry>
-            <entry>
-              70
-            </entry>
-          </row>
-          <row>
-            <entry>
-              <b>500 GB</b>
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              10
-            </entry>
-            <entry>
-              70
-            </entry>
-            <entry>
-              135
-            </entry>
-          </row>
-          <row>
-            <entry>
-              <b>1 TB</b>
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              15
-            </entry>
-            <entry>
-              135
-            </entry>
-            <entry>
-              270
-            </entry>
-          </row>
-          <row>
-            <entry>
-              <b>15 TB</b>
-            </entry>
-            <entry>
-              2
-            </entry>
-            <entry>
-              20
-            </entry>
-            <entry>
-              200
-            </entry>
-            <entry>
-              N/A
-            </entry>
-            <entry>
-              N/A
-            </entry>
-          </row>
-          <row>
-            <entry>
-              <b>30 TB</b>
-            </entry>
-            <entry>
-              4
-            </entry>
-            <entry>
-              40
-            </entry>
-            <entry>
-              400
-            </entry>
-            <entry>
-              N/A
-            </entry>
-            <entry>
-              N/A
-            </entry>
-          </row>
-          <row>
-            <entry>
-              <b>60 TB</b>
-            </entry>
-            <entry>
-              8
-            </entry>
-            <entry>
-              80
-            </entry>
-            <entry>
-              800
-            </entry>
-            <entry>
-              N/A
-            </entry>
-            <entry>
-              N/A
-            </entry>
-          </row>
-        </tbody>
-      </tgroup>
-    </table>
-
-    <section id="sizing_factors">
-
-      <title>Factors Affecting Scalability</title>
-
-      <p>
-        A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
-        node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
-        cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
-        memory.
-      </p>
-
-      <p>
-        If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
-        the network load; in fact, a larger cluster could increase network traffic because some queries involve
-        <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
-        throughput in a network-constrained environment.
-      </p>
-
-      <p>
-        Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
-        concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
-        network is saturated yet. This can happen because currently Impala uses only a single core per node to
-        process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
-        cannot run more than 2 such queries at the same time.
-      </p>
-
-      <p>
-        Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
-        workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
-        GB/sec. Consider increasing the memory per node.
-      </p>
-
-      <p>
-        As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
-        throughput estimate.
-      </p>
-    </section>
-
-    <section id="sizing_details">
-
-      <title>A More Precise Approach</title>
-
-      <p>
-        A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
-        size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
-        data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
-      </p>
-
-<codeblock>Eq 1: N &gt; QPM * D / 100 GB
-</codeblock>
-
-      <p>
-        Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
-        required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
-        estimate the number of node using our equation above.
-      </p>
-
-<codeblock>N &gt; QPM * D / 100GB
-N &gt; 400 * 50GB / 100GB
-N &gt; 200
-</codeblock>
-
-      <p>
-        Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
-      </p>
-
-      <p>
-        Depending on the complexity of the query, the processing rate of query might change. If the query has more
-        joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
-        process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
-        filtering on numbers, the processing rate can be higher.
-      </p>
-    </section>
-
-    <section id="sizing_mem_estimate">
-
-      <title>Estimating Memory Requirements</title>
-      <!--
-  <prolog>
-    <metadata>
-      <data name="Category" value="Memory"/>
-    </metadata>
-  </prolog>
-      -->
-
-      <p>
-        Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
-        joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
-        STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
-        below to calculate the minimum memory requirement.
-      </p>
-
-      <p>
-        Suppose you are running the following join:
-      </p>
-
-<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
-from a, b
-where a.key = b.key
-and b.col_1 in (1,2,4...)
-and b.col_4 in (....);
-</codeblock>
-
-      <p>
-        And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
-      </p>
-
-      <p>
-        The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
-        filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
-        than the total memory of the entire cluster.
-      </p>
-
-<codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
-  selectivity factor from the predicate *
-  projection factor * compression ratio
-</codeblock>
-
-      <p>
-        In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
-        predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
-        the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
-        Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
-      </p>
-
-<codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
-  selectivity factor from the predicate *
-  projection factor * compression ratio
-  = 100TB * 10% * 5/200 * 3
-  = 0.75TB
-  = 750GB
-</codeblock>
-
-      <p>
-        So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
-        TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
-        of this magnitude.
-      </p>
-    </section>
-  </conbody>
-</concept>