You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by tm...@apache.org on 2018/10/05 21:39:18 UTC

[6/8] impala git commit: IMPALA-7651: [DOCS] Kudu support to scheduler-related query hints and options

IMPALA-7651: [DOCS] Kudu support to scheduler-related query hints and options

The SCHEDULE_RANDOM_REPLICA query option and the RANDOM_REPLICA hint
support Kudu as well as HDFS.

Change-Id: I481d2a002edc1a18491bf9fc249e868005b42fa5
Reviewed-on: http://gerrit.cloudera.org:8080/11584
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Thomas Marshall <th...@cmu.edu>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/23428dc1
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/23428dc1
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/23428dc1

Branch: refs/heads/master
Commit: 23428dc147fa7ff0b65c4942229e6147bda243d5
Parents: 1914a8b
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 4 15:19:32 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Fri Oct 5 19:36:21 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_hints.xml                   | 47 +++++++-------
 docs/topics/impala_schedule_random_replica.xml | 72 ++++++++-------------
 2 files changed, 51 insertions(+), 68 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/23428dc1/docs/topics/impala_hints.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml
index 6f853c1..1f2f08f 100644
--- a/docs/topics/impala_hints.xml
+++ b/docs/topics/impala_hints.xml
@@ -359,31 +359,32 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     <p conref="../shared/impala_common.xml#common/kudu_hints"/>
 
     <p rev="IMPALA-2924">
-      <b>Hints for scheduling of HDFS blocks:</b>
+      <b>Hints for scheduling of scan ranges (HDFS data blocks or Kudu
+        tablets)</b>
     </p>
 
-    <p rev="IMPALA-2924">
-      The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>, <codeph>/* +SCHEDULE_DISK_LOCAL
-      */</codeph>, and <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same effect as
-      specifying the <codeph>REPLICA_PREFERENCE</codeph> query option with the respective option
-      settings of <codeph>CACHE_LOCAL</codeph>, <codeph>DISK_LOCAL</codeph>, or
-      <codeph>REMOTE</codeph>. The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the same as
-      enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
-    </p>
-
-    <p rev="IMPALA-2924">
-      You can use these hints in combination by separating them with commas, for example,
-      <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. See
-      <xref keyref="replica_preference"/> and <xref keyref="schedule_random_replica"/> for
-      information about how these settings influence the way Impala processes HDFS data blocks.
-    </p>
-
-    <p rev="IMPALA-2924">
-      Specifying the replica preference as a query hint always overrides the query option
-      setting. Specifying either the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option or
-      the corresponding <codeph>RANDOM_REPLICA</codeph> query hint enables the random
-      tie-breaking behavior when processing data blocks during the query.
-    </p>
+    <p rev="IMPALA-2924"> The hints <codeph>/* +SCHEDULE_CACHE_LOCAL
+      */</codeph>, <codeph>/* +SCHEDULE_DISK_LOCAL */</codeph>, and <codeph>/*
+        +SCHEDULE_REMOTE */</codeph> have the same effect as specifying the
+        <codeph>REPLICA_PREFERENCE</codeph> query option with the respective
+      option settings of <codeph>CACHE_LOCAL</codeph>,
+        <codeph>DISK_LOCAL</codeph>, or <codeph>REMOTE</codeph>. </p>
+    <p rev="IMPALA-2924"> Specifying the replica preference as a query hint
+      always overrides the query option setting. </p>
+    <p rev="IMPALA-2924">The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the
+      same as enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query
+      option. </p>
+
+    <p rev="IMPALA-2924"> You can use these hints in combination by separating
+      them with commas, for example, <codeph>/*
+        +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. See <xref
+        keyref="replica_preference"/> and <xref keyref="schedule_random_replica"
+      /> for information about how these settings influence the way Impala
+      processes HDFS data blocks or Kudu tablets. </p>
+    <p rev="IMPALA-2924">Specifying either the
+        <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option or the
+      corresponding <codeph>RANDOM_REPLICA</codeph> query hint enables the
+      random tie-breaking behavior when processing data blocks during the query. </p>
 
     <p>
       <b>Suggestions versus directives:</b>

http://git-wip-us.apache.org/repos/asf/impala/blob/23428dc1/docs/topics/impala_schedule_random_replica.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_schedule_random_replica.xml b/docs/topics/impala_schedule_random_replica.xml
index ae5978f..f8c50fe 100644
--- a/docs/topics/impala_schedule_random_replica.xml
+++ b/docs/topics/impala_schedule_random_replica.xml
@@ -21,7 +21,13 @@ under the License.
 <concept id="schedule_random_replica" rev="2.5.0">
 
   <title>SCHEDULE_RANDOM_REPLICA Query Option (<keyword keyref="impala25"/> or higher only)</title>
-  <titlealts audience="PDF"><navtitle>SCHEDULE_RANDOM_REPLICA</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>SCHEDULE_RANDOM_REPLICA</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -34,14 +40,23 @@ under the License.
 
   <conbody>
 
-    <p rev="2.5.0">
-      <indexterm audience="hidden">SCHEDULE_RANDOM_REPLICA query option</indexterm>
+    <p>
+      The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option fine-tunes the scheduling
+      algorithm for deciding which host processes each HDFS data block or Kudu tablet to reduce
+      the chance of CPU hotspots.
+    </p>
+
+    <p>
+      By default, Impala estimates how much work each host has done for the query, and selects
+      the host that has the lowest workload. This algorithm is intended to reduce CPU hotspots
+      arising when the same host is selected to process multiple data blocks / tablets. Use the
+      <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option if hotspots still arise for some
+      combinations of queries and data layout.
     </p>
 
     <p>
-      The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option fine-tunes the algorithm for deciding which host
-      processes each HDFS data block. It only applies to tables and partitions that are not enabled
-      for the HDFS caching feature.
+      The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option only applies to tables and
+      partitions that are not enabled for the HDFS caching.
     </p>
 
     <p conref="../shared/impala_common.xml#common/type_boolean"/>
@@ -50,49 +65,16 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/added_in_250"/>
 
-    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
-
-    <p>
-      In the presence of HDFS cached replicas, Impala randomizes
-      which host processes each cached data block.
-      To ensure that HDFS data blocks are cached on more
-      than one host, use the <codeph>WITH REPLICATION</codeph> clause along with
-      the <codeph>CACHED IN</codeph> clause in a
-      <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement.
-      Specify a replication value greater than or equal to the HDFS block replication factor.
-    </p>
-
-    <p>
-      The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option applies to tables and partitions
-      that <i>do not</i> use HDFS caching.
-      By default, Impala estimates how much work each host has done for
-      the query, and selects the host that has the lowest workload.
-      This algorithm is intended to reduce CPU hotspots arising when the
-      same host is selected to process multiple data blocks, but hotspots
-      might still arise for some combinations of queries and data layout.
-      When the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> option is enabled,
-      Impala further randomizes the scheduling algorithm for non-HDFS cached blocks,
-      which can further reduce the chance of CPU hotspots.
-    </p>
-
-    <p rev="IMPALA-2979">
-      This query option works in conjunction with the work scheduling improvements
-      in <keyword keyref="impala25_full"/> and higher. The scheduling improvements
-      distribute the processing for cached HDFS data blocks to minimize hotspots:
-      if a data block is cached on more than one host, Impala chooses which host
-      to process each block based on which host has read the fewest bytes during
-      the current query. Enable <codeph>SCHEDULE_RANDOM_REPLICA</codeph> setting if CPU hotspots
-      still persist because of cases where hosts are <q>tied</q> in terms of
-      the amount of work done; by default, Impala picks the first eligible host
-      in this case.
-    </p>
-
     <p conref="../shared/impala_common.xml#common/related_info"/>
+
     <p>
       <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
-      <xref href="impala_scalability.xml#scalability_hotspots"/>
-      , <xref href="impala_replica_preference.xml#replica_preference"/>
+      <xref
+        href="impala_scalability.xml#scalability_hotspots"/> ,
+      <xref
+        href="impala_replica_preference.xml#replica_preference"/>
     </p>
 
   </conbody>
+
 </concept>