You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by tm...@apache.org on 2018/10/05 21:39:18 UTC
[6/8] impala git commit: IMPALA-7651: [DOCS] Kudu support to
scheduler-related query hints and options
IMPALA-7651: [DOCS] Kudu support to scheduler-related query hints and options
The SCHEDULE_RANDOM_REPLICA query option and the RANDOM_REPLICA hint
support Kudu as well as HDFS.
Change-Id: I481d2a002edc1a18491bf9fc249e868005b42fa5
Reviewed-on: http://gerrit.cloudera.org:8080/11584
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Thomas Marshall <th...@cmu.edu>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/23428dc1
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/23428dc1
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/23428dc1
Branch: refs/heads/master
Commit: 23428dc147fa7ff0b65c4942229e6147bda243d5
Parents: 1914a8b
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 4 15:19:32 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Fri Oct 5 19:36:21 2018 +0000
----------------------------------------------------------------------
docs/topics/impala_hints.xml | 47 +++++++-------
docs/topics/impala_schedule_random_replica.xml | 72 ++++++++-------------
2 files changed, 51 insertions(+), 68 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/23428dc1/docs/topics/impala_hints.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml
index 6f853c1..1f2f08f 100644
--- a/docs/topics/impala_hints.xml
+++ b/docs/topics/impala_hints.xml
@@ -359,31 +359,32 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
<p conref="../shared/impala_common.xml#common/kudu_hints"/>
<p rev="IMPALA-2924">
- <b>Hints for scheduling of HDFS blocks:</b>
+ <b>Hints for scheduling of scan ranges (HDFS data blocks or Kudu
+ tablets)</b>
</p>
- <p rev="IMPALA-2924">
- The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>, <codeph>/* +SCHEDULE_DISK_LOCAL
- */</codeph>, and <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same effect as
- specifying the <codeph>REPLICA_PREFERENCE</codeph> query option with the respective option
- settings of <codeph>CACHE_LOCAL</codeph>, <codeph>DISK_LOCAL</codeph>, or
- <codeph>REMOTE</codeph>. The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the same as
- enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
- </p>
-
- <p rev="IMPALA-2924">
- You can use these hints in combination by separating them with commas, for example,
- <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. See
- <xref keyref="replica_preference"/> and <xref keyref="schedule_random_replica"/> for
- information about how these settings influence the way Impala processes HDFS data blocks.
- </p>
-
- <p rev="IMPALA-2924">
- Specifying the replica preference as a query hint always overrides the query option
- setting. Specifying either the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option or
- the corresponding <codeph>RANDOM_REPLICA</codeph> query hint enables the random
- tie-breaking behavior when processing data blocks during the query.
- </p>
+ <p rev="IMPALA-2924"> The hints <codeph>/* +SCHEDULE_CACHE_LOCAL
+ */</codeph>, <codeph>/* +SCHEDULE_DISK_LOCAL */</codeph>, and <codeph>/*
+ +SCHEDULE_REMOTE */</codeph> have the same effect as specifying the
+ <codeph>REPLICA_PREFERENCE</codeph> query option with the respective
+ option settings of <codeph>CACHE_LOCAL</codeph>,
+ <codeph>DISK_LOCAL</codeph>, or <codeph>REMOTE</codeph>. </p>
+ <p rev="IMPALA-2924"> Specifying the replica preference as a query hint
+ always overrides the query option setting. </p>
+ <p rev="IMPALA-2924">The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the
+ same as enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query
+ option. </p>
+
+ <p rev="IMPALA-2924"> You can use these hints in combination by separating
+ them with commas, for example, <codeph>/*
+ +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. See <xref
+ keyref="replica_preference"/> and <xref keyref="schedule_random_replica"
+ /> for information about how these settings influence the way Impala
+ processes HDFS data blocks or Kudu tablets. </p>
+ <p rev="IMPALA-2924">Specifying either the
+ <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option or the
+ corresponding <codeph>RANDOM_REPLICA</codeph> query hint enables the
+ random tie-breaking behavior when processing data blocks during the query. </p>
<p>
<b>Suggestions versus directives:</b>
http://git-wip-us.apache.org/repos/asf/impala/blob/23428dc1/docs/topics/impala_schedule_random_replica.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_schedule_random_replica.xml b/docs/topics/impala_schedule_random_replica.xml
index ae5978f..f8c50fe 100644
--- a/docs/topics/impala_schedule_random_replica.xml
+++ b/docs/topics/impala_schedule_random_replica.xml
@@ -21,7 +21,13 @@ under the License.
<concept id="schedule_random_replica" rev="2.5.0">
<title>SCHEDULE_RANDOM_REPLICA Query Option (<keyword keyref="impala25"/> or higher only)</title>
- <titlealts audience="PDF"><navtitle>SCHEDULE_RANDOM_REPLICA</navtitle></titlealts>
+
+ <titlealts audience="PDF">
+
+ <navtitle>SCHEDULE_RANDOM_REPLICA</navtitle>
+
+ </titlealts>
+
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -34,14 +40,23 @@ under the License.
<conbody>
- <p rev="2.5.0">
- <indexterm audience="hidden">SCHEDULE_RANDOM_REPLICA query option</indexterm>
+ <p>
+ The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option fine-tunes the scheduling
+ algorithm for deciding which host processes each HDFS data block or Kudu tablet to reduce
+ the chance of CPU hotspots.
+ </p>
+
+ <p>
+ By default, Impala estimates how much work each host has done for the query, and selects
+ the host that has the lowest workload. This algorithm is intended to reduce CPU hotspots
+ arising when the same host is selected to process multiple data blocks / tablets. Use the
+ <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option if hotspots still arise for some
+ combinations of queries and data layout.
</p>
<p>
- The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option fine-tunes the algorithm for deciding which host
- processes each HDFS data block. It only applies to tables and partitions that are not enabled
- for the HDFS caching feature.
+ The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option only applies to tables and
+ partitions that are not enabled for the HDFS caching.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
@@ -50,49 +65,16 @@ under the License.
<p conref="../shared/impala_common.xml#common/added_in_250"/>
- <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
-
- <p>
- In the presence of HDFS cached replicas, Impala randomizes
- which host processes each cached data block.
- To ensure that HDFS data blocks are cached on more
- than one host, use the <codeph>WITH REPLICATION</codeph> clause along with
- the <codeph>CACHED IN</codeph> clause in a
- <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement.
- Specify a replication value greater than or equal to the HDFS block replication factor.
- </p>
-
- <p>
- The <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option applies to tables and partitions
- that <i>do not</i> use HDFS caching.
- By default, Impala estimates how much work each host has done for
- the query, and selects the host that has the lowest workload.
- This algorithm is intended to reduce CPU hotspots arising when the
- same host is selected to process multiple data blocks, but hotspots
- might still arise for some combinations of queries and data layout.
- When the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> option is enabled,
- Impala further randomizes the scheduling algorithm for non-HDFS cached blocks,
- which can further reduce the chance of CPU hotspots.
- </p>
-
- <p rev="IMPALA-2979">
- This query option works in conjunction with the work scheduling improvements
- in <keyword keyref="impala25_full"/> and higher. The scheduling improvements
- distribute the processing for cached HDFS data blocks to minimize hotspots:
- if a data block is cached on more than one host, Impala chooses which host
- to process each block based on which host has read the fewest bytes during
- the current query. Enable <codeph>SCHEDULE_RANDOM_REPLICA</codeph> setting if CPU hotspots
- still persist because of cases where hosts are <q>tied</q> in terms of
- the amount of work done; by default, Impala picks the first eligible host
- in this case.
- </p>
-
<p conref="../shared/impala_common.xml#common/related_info"/>
+
<p>
<xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
- <xref href="impala_scalability.xml#scalability_hotspots"/>
- , <xref href="impala_replica_preference.xml#replica_preference"/>
+ <xref
+ href="impala_scalability.xml#scalability_hotspots"/> ,
+ <xref
+ href="impala_replica_preference.xml#replica_preference"/>
</p>
</conbody>
+
</concept>