You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2018/04/17 20:25:52 UTC

[3/4] impala git commit: IMPALA-6723: [DOCS] Hints for CTAS

IMPALA-6723: [DOCS] Hints for CTAS

Change-Id: I91d9f4f039a603382ff4415d1dd22a351279cbfa

IMPALA-6723 Hints for CTAS

Change-Id: I201a4e1ddaf62164e1f6b636c4e1e60af60e1af7

IMPALA-6723: [DOCS] Hints for CTAS

Optimizer hints were move out of SELECT section.
Hints for CTAS were added to the same section as INSERT.

Change-Id: I91d9f4f039a603382ff4415d1dd22a351279cbfa
Reviewed-on: http://gerrit.cloudera.org:8080/9993
Reviewed-by: Alex Rodoni <ar...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/ea698cd4
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/ea698cd4
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/ea698cd4

Branch: refs/heads/master
Commit: ea698cd497f63908b231e4b108c20d259e5bd8fb
Parents: 0e98b9a
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Fri Apr 6 09:56:48 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Tue Apr 17 00:34:52 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap           |   2 +-
 docs/shared/impala_common.xml |   1 +
 docs/topics/impala_hints.xml  | 226 +++++++++++++++++++++++++++++--------
 3 files changed, 181 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/ea698cd4/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 08b69ca..4e9c71a 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -166,7 +166,6 @@ under the License.
         <topicref rev="IMPALA-5309" href="topics/impala_tablesample.xml"/>
         <topicref href="topics/impala_with.xml"/>
         <topicref href="topics/impala_distinct.xml"/>
-        <topicref href="topics/impala_hints.xml"/>
       </topicref>
       <topicref href="topics/impala_set.xml">
         <topicref href="topics/impala_query_options.xml">
@@ -230,6 +229,7 @@ under the License.
       <topicref href="topics/impala_update.xml"/>
       <topicref href="topics/impala_upsert.xml"/>
       <topicref href="topics/impala_use.xml"/>
+      <topicref href="topics/impala_hints.xml"/>
     </topicref>
     <topicref href="topics/impala_functions.xml">
       <topicref href="topics/impala_math_functions.xml"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/ea698cd4/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index 12d7d4e..df58a53 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -3063,6 +3063,7 @@ select max(height), avg(height) from census_data where age &gt; 20;
         Another way to define different names for the same tables or columns is to create views. See
         <xref href="../topics/impala_views.xml#views"/> for details.
       </p>
+      <!--Alex R: Insert hints below is being refactored in impala_hints.xml fore more general purpose. Keep this for now for impala_paquet.xml.-->
 
       <p id="insert_hints">
         When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in

http://git-wip-us.apache.org/repos/asf/impala/blob/ea698cd4/docs/topics/impala_hints.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml
index 6cafcfb..b936b6a 100644
--- a/docs/topics/impala_hints.xml
+++ b/docs/topics/impala_hints.xml
@@ -20,8 +20,8 @@ under the License.
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="hints">
 
-  <title>Query Hints in Impala SELECT Statements</title>
-  <titlealts audience="PDF"><navtitle>Hints</navtitle></titlealts>
+  <title>Optimizer Hints</title>
+  <titlealts audience="PDF"><navtitle>Optimizer Hints</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -37,15 +37,13 @@ under the License.
   <conbody>
 
     <p>
-      <indexterm audience="hidden">hints</indexterm>
-      The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as
-      a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient
-      performance.
-    </p>
+      <indexterm audience="hidden">hints</indexterm> The Impala SQL supports
+      query hints, for fine-tuning the inner workings of queries. Specify hints
+      as a temporary workaround for expensive queries, where missing statistics
+      or other factors cause inefficient performance. </p>
 
-    <p>
-      Hints are most often used for the most resource-intensive kinds of Impala queries:
-    </p>
+    <p> Hints are most often used for the resource-intensive Impala queries,
+      such as: </p>
 
     <ul>
       <li>
@@ -61,41 +59,18 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
 
-    <p>
-      You can also represent the hints as keywords surrounded by <codeph>[]</codeph>
-      square brackets; include the brackets in the text of the SQL statement.
-      <note conref="../shared/impala_common.xml#common/square_bracket_hint_caveat"/>
-    </p>
-
-<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
-<varname>join_left_hand_table</varname>
-  JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
-<varname>join_right_hand_table</varname>
-<varname>remainder_of_query</varname>;
-
-INSERT <varname>insert_clauses</varname>
-  [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
-  [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>]
-  SELECT <varname>remainder_of_query</varname>;
-
-<ph rev="2.12.0 IMPALA-4168">
-UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
-  [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>]
-  <varname>upsert_clauses</varname>
-  SELECT <varname>remainder_of_query</varname>;</ph>
-</codeblock>
-
-    <p rev="2.0.0">
-      In <keyword keyref="impala20_full"/> and higher, you can also specify the hints inside comments that use
-      either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. Specify a <codeph>+</codeph> symbol
-      immediately before the hint name. Recently added hints are only available using the <codeph>/* */</codeph>
-      and <codeph>--</codeph> notation.
-      For clarity, the <codeph>/* */</codeph> and <codeph>--</codeph> styles
-      are used in the syntax and examples throughout this section.
-      With the <codeph>/* */</codeph> or <codeph>--</codeph> notation for
-      hints, specify a <codeph>+</codeph> symbol immediately before the first hint name.
-      Multiple hints can be specified separated by commas, for example
-      <codeph>/* +clustered,shuffle */</codeph>
+    <p rev="2.0.0"> In <keyword keyref="impala20_full"/> and higher, you can
+      specify the hints inside comments that use either the <codeph>/*
+        */</codeph> or <codeph>--</codeph> notation. Specify a
+        <codeph>+</codeph> symbol immediately before the hint name. Recently
+      added hints are only available using the <codeph>/* */</codeph> and
+        <codeph>--</codeph> notation. For clarity, the <codeph>/* */</codeph>
+      and <codeph>--</codeph> styles are used in the syntax and examples
+      throughout this section. With the <codeph>/* */</codeph> or
+        <codeph>--</codeph> notation for hints, specify a <codeph>+</codeph>
+      symbol immediately before the first hint name. Multiple hints can be
+      specified separated by commas, for example <codeph>/* +clustered,shuffle
+        */</codeph>
     </p>
 
 <codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
@@ -167,6 +142,43 @@ UPSERT -- +CLUSTERED
 UPSERT /* +CLUSTERED */
   <varname>upsert_clauses</varname>
   SELECT <varname>remainder_of_query</varname>;</ph>
+
+CREATE /* +SHUFFLE|NOSHUFFLE */
+  <varname>table_clauses</varname>
+  AS SELECT <varname>remainder_of_query</varname>;
+
+CREATE -- +SHUFFLE|NOSHUFFLE
+  <varname>table_clauses</varname>
+  AS SELECT <varname>remainder_of_query</varname>;
+
+CREATE /* +CLUSTER|NOCLUSTER */
+  <varname>table_clauses</varname>
+  AS SELECT <varname>remainder_of_query</varname>;
+
+CREATE -- +CLUSTER|NOCLUSTER
+  <varname>table_clauses</varname>
+  AS SELECT <varname>remainder_of_query</varname>;
+</codeblock>
+    <p>The square bracket style hints are supported for backward compatibility,
+      but the syntax is deprecated and will be removed in a future release. For
+      that reason, any newly added hints are not available with the square
+      bracket syntax.</p>
+    <codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
+<varname>join_left_hand_table</varname>
+  JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
+<varname>join_right_hand_table</varname>
+<varname>remainder_of_query</varname>;
+
+INSERT <varname>insert_clauses</varname>
+  [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
+  [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>]
+  SELECT <varname>remainder_of_query</varname>;
+
+<ph rev="2.12.0 IMPALA-4168">
+UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
+  [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>]
+  <varname>upsert_clauses</varname>
+  SELECT <varname>remainder_of_query</varname>;</ph>
 </codeblock>
 
     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
@@ -226,10 +238,130 @@ UPSERT /* +CLUSTERED */
     </ul>
 
     <p>
-      <b>Hints for INSERT ... SELECT queries:</b>
+      <b>Hints for INSERT ... SELECT and CREATE TABLE AS SELECT (CTAS):</b>
+    </p>
+    <p id="insert_hints">
+      When inserting into partitioned tables, such as using the Parquet file
+      format, you can include a hint in the <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT(CTAS)</codeph>
+      statements to fine-tune the overall performance of the operation and its
+      resource usage.</p>
+    <p>
+      You would only use hints if an <codeph>INSERT</codeph> or
+        <codeph>CTAS</codeph> into a partitioned table was failing due to
+      capacity limits, or if such an operation was succeeding but with
+      less-than-optimal performance.
     </p>
 
-    <p conref="../shared/impala_common.xml#common/insert_hints"/>
+    <ul>
+      <li>
+        <codeph>/* +SHUFFLE */</codeph> and <codeph>/* +NOSHUFFLE */</codeph> Hints
+        <ul>
+          <li>
+            <codeph>/* +SHUFFLE */</codeph> adds an exchange node, before
+            writing the data, which re-partitions the result of the
+              <codeph>SELECT</codeph> based on the partitioning columns of the
+            target table. With this hint, only one node writes to a partition at
+            a time, minimizing the global number of simultaneous writes and the
+            number of memory buffers holding data for individual partitions.
+            This also reduces fragmentation, resulting in fewer files. Thus it
+            reduces overall resource usage of the <codeph>INSERT</codeph> or
+              <codeph>CTAS</codeph> operation and allows some operations to
+            succeed that otherwise would fail. It does involve some data
+            transfer between the nodes so that the data files for a particular
+            partition are all written on the same node.
+
+            <p>
+              Use <codeph>/* +SHUFFLE */</codeph> in cases where an <codeph>INSERT</codeph>
+              or <codeph>CTAS</codeph> statement fails or runs inefficiently due
+              to all nodes attempting to write data for all partitions.
+            </p>
+
+            <p> If the table is unpartitioned or every partitioning expression
+              is constant, then <codeph>/* +SHUFFLE */</codeph> will cause every
+              write to happen on the coordinator node.
+            </p>
+          </li>
+
+          <li>
+            <codeph>/* +NOSHUFFLE */</codeph> does not add exchange node before
+            inserting to partitioned tables and disables re-partitioning. So the
+            selected execution plan might be faster overall, but might also
+            produce a larger number of small data files or exceed capacity
+            limits, causing the <codeph>INSERT</codeph> or <codeph>CTAS</codeph>
+            operation to fail.
+
+            <p> Impala automatically uses the <codeph>/*
+                +SHUFFLE */</codeph> method if any partition key column in the
+              source table, mentioned in the <codeph>SELECT</codeph> clause,
+              does not have column statistics. In this case, use the <codeph>/*
+                +NOSHUFFLE */</codeph> hint if you want to override this default
+              behavior.
+            </p>
+          </li>
+
+          <li>
+            If column statistics are available for all partition key columns
+            in the source table mentioned in the <codeph>INSERT ...
+              SELECT</codeph> or <codeph>CTAS</codeph> query, Impala chooses
+            whether to use the <codeph>/* +SHUFFLE */</codeph> or <codeph>/*
+              +NOSHUFFLE */</codeph> technique based on the estimated number of
+            distinct values in those columns and the number of nodes involved in
+            the operation. In this case, you might need the <codeph>/* +SHUFFLE
+              */</codeph> or the <codeph>/* +NOSHUFFLE */</codeph> hint to
+            override the execution plan selected by Impala.
+          </li>
+        </ul>
+      </li>
+
+      <li>
+        <codeph>/* +CLUSTERED */</codeph> and <codeph>/* +NOCLUSTERED
+          */</codeph> Hints
+        <ul>
+          <li>
+            <codeph>/* +CLUSTERED */</codeph> sorts data by the partition
+            columns before inserting to ensure that only one partition is
+            written at a time per node. Use this hint to reduce the number of
+            files kept open and the number of buffers kept in memory
+            simultaneously. This technique is primarily useful for inserts into
+            Parquet tables, where the large block size requires substantial
+            memory to buffer data for multiple output files at once. This hint
+            is available in <keyword keyref="impala28_full"/> or higher.
+
+            <p>
+              Starting in <keyword keyref="impala30_full"/>, <codeph>/*
+                +CLUSTERED */</codeph> is the default behavior for HDFS tables.
+            </p>
+          </li>
+
+          <li>
+            <codeph>/* +NOCLUSTERED */</codeph> does not sort by primary key
+            before insert. This hint is available in <keyword
+              keyref="impala28_full"/> or higher.
+
+            <p>
+              Use this hint when inserting to Kudu tables.
+            </p>
+
+            <p>
+              In the versions lower than <keyword keyref="impala30_full"/>,
+                <codeph>/* +NOCLUSTERED */</codeph> is the default in HDFS
+              tables.
+            </p>
+          </li>
+        </ul>
+      </li>
+    </ul>
+
+    <p>
+      Starting from <keyword keyref="impala29_full"/>, <codeph>INSERT</codeph>
+      or <codeph>UPSERT</codeph> operations into Kudu tables automatically have
+      an exchange and sort node added to the plan that partitions and sorts the
+      rows according to the partitioning/primary key scheme of the target table
+      (unless the number of rows to be inserted is small enough to trigger
+      single node execution). Use the<codeph> /* +NOCLUSTERED */</codeph> and
+        <codeph>/* +NOSHUFFLE */</codeph> hints together to disable partitioning
+      and sorting before the rows are sent to Kudu.
+    </p>
 
     <p rev="IMPALA-2924">
       <b>Hints for scheduling of HDFS blocks:</b>