You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@impala.apache.org by ar...@apache.org on 2019/02/09 01:49:04 UTC

[impala] branch master updated (3ce34a8 -> 022ba2b)

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git.


    from 3ce34a8  IMPALA-8071: Initial unified backend test framework
     new 13be8cd  IMPALA-8170: [DOCS] Added a section on load balancing proxy with TLS
     new bf96eb3  IMPALA-8154: Disable Kerberos auth_to_local setting
     new 29cdaf0  IMPALA-8098: [DOCS] Port change for the SHUTDOWN command
     new 5f59eef  [DOCS] Re-formatted impala_ldap.xml
     new 022ba2b  IMPALA-8105: [DOCS] Document cache_remote_file_handles flag

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 be/src/rpc/authentication.cc                |   5 +
 docs/impala_keydefs.ditamap                 |   2 +
 docs/topics/impala_incompatible_changes.xml |   9 +
 docs/topics/impala_ldap.xml                 | 443 +++++++++-----
 docs/topics/impala_proxy.xml                | 444 ++++++++------
 docs/topics/impala_scalability.xml          | 880 ++++++++++++++++------------
 docs/topics/impala_shutdown.xml             |  17 +-
 7 files changed, 1090 insertions(+), 710 deletions(-)

[impala] 05/05: IMPALA-8105: [DOCS] Document cache_remote_file_handles flag

Posted by ar...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 022ba2bbdb6feec0ba944847e92d5a051cedc9a1
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Mon Feb 4 17:41:23 2019 -0800

    IMPALA-8105: [DOCS] Document cache_remote_file_handles flag
    
    Change-Id: Id649e733324f55a80a0199302dfa3b627ad183cf
    Reviewed-on: http://gerrit.cloudera.org:8080/12362
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Joe McDonnell <jo...@cloudera.com>
---
 docs/topics/impala_scalability.xml | 880 +++++++++++++++++++++----------------
 1 file changed, 498 insertions(+), 382 deletions(-)

diff --git a/docs/topics/impala_scalability.xml b/docs/topics/impala_scalability.xml
index 94c2fdb..71b8425 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -1,4 +1,5 @@
-<?xml version="1.0" encoding="UTF-8"?><!--
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
@@ -20,7 +21,13 @@ under the License.
 <concept id="scalability">
 
   <title>Scalability Considerations for Impala</title>
-  <titlealts audience="PDF"><navtitle>Scalability Considerations</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>Scalability Considerations</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
@@ -30,7 +37,7 @@ under the License.
       <data name="Category" value="Developers"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Scalability"/>
-      <!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
+<!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
       <data name="Category" value="Proof of Concept"/>
     </metadata>
   </prolog>
@@ -38,10 +45,11 @@ under the License.
   <conbody>
 
     <p>
-      This section explains how the size of your cluster and the volume of data influences SQL performance and
-      schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory
-      limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of
-      scalability issues, such as a single slow node that causes performance problems for queries.
+      This section explains how the size of your cluster and the volume of data influences SQL
+      performance and schema design for Impala tables. Typically, adding more cluster capacity
+      reduces problems due to memory limits or disk throughput. On the other hand, larger
+      clusters are more likely to have other kinds of scalability issues, such as a single slow
+      node that causes performance problems for queries.
     </p>
 
     <p outputclass="toc inpage"/>
@@ -53,14 +61,15 @@ under the License.
   <concept audience="hidden" id="scalability_memory">
 
     <title>Overview and Guidelines for Impala Memory Usage</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Memory"/>
-      <data name="Category" value="Concepts"/>
-      <data name="Category" value="Best Practices"/>
-      <data name="Category" value="Guidelines"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Memory"/>
+        <data name="Category" value="Concepts"/>
+        <data name="Category" value="Best Practices"/>
+        <data name="Category" value="Guidelines"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
@@ -158,7 +167,9 @@ Memory Usage: Additional Notes
 *  Review previous common issues on out-of-memory
 *  Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently
 </codeblock>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_catalog">
@@ -173,24 +184,25 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables
-        containing relatively few, large data files. Schemas containing thousands of tables, or tables containing
-        thousands of partitions, can encounter performance issues during startup or during DDL operations such as
-        <codeph>ALTER TABLE</codeph> statements.
+        Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized
+        for tables containing relatively few, large data files. Schemas containing thousands of
+        tables, or tables containing thousands of partitions, can encounter performance issues
+        during startup or during DDL operations such as <codeph>ALTER TABLE</codeph> statements.
       </p>
 
       <note type="important" rev="TSB-168">
-      <p>
-        Because of a change in the default heap size for the <cmdname>catalogd</cmdname> daemon in
-        <keyword keyref="impala25_full"/> and higher, the following procedure to increase the <cmdname>catalogd</cmdname>
-        memory limit might be required following an upgrade to <keyword keyref="impala25_full"/> even if not
-        needed previously.
-      </p>
+        <p>
+          Because of a change in the default heap size for the <cmdname>catalogd</cmdname>
+          daemon in <keyword keyref="impala25_full"/> and higher, the following procedure to
+          increase the <cmdname>catalogd</cmdname> memory limit might be required following an
+          upgrade to <keyword keyref="impala25_full"/> even if not needed previously.
+        </p>
       </note>
 
       <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>
 
     </conbody>
+
   </concept>
 
   <concept rev="2.1.0" id="statestore_scalability">
@@ -200,30 +212,33 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p>
-        Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message to its subscribers. This message contained all
-        updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the
-        statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a
+        Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message
+        to its subscribers. This message contained all updates for any topics that a subscriber
+        had subscribed to. It also served to let subscribers know that the statestore had not
+        failed, and conversely the statestore used the success of sending a heartbeat to a
         subscriber to decide whether or not the subscriber had failed.
       </p>
 
       <p>
-        Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large
-        numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata
-        updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time
-        out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of
-        statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or
-        restarted.
+        Combining topic updates and failure detection in a single message led to bottlenecks in
+        clusters with large numbers of tables, partitions, and HDFS data blocks. When the
+        statestore was overloaded with metadata updates to transmit, heartbeat messages were
+        sent less frequently, sometimes causing subscribers to time out their connection with
+        the statestore. Increasing the subscriber timeout and decreasing the frequency of
+        statestore heartbeats worked around the problem, but reduced responsiveness when the
+        statestore failed or restarted.
       </p>
 
       <p>
-        As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and heartbeats in separate messages. This allows the
-        statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to
-        send topic updates according to a fixed schedule, reducing statestore network overhead.
+        As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and
+        heartbeats in separate messages. This allows the statestore to send and receive a steady
+        stream of lightweight heartbeats, and removes the requirement to send topic updates
+        according to a fixed schedule, reducing statestore network overhead.
       </p>
 
       <p>
-        The statestore now has the following relevant configuration flags for the <cmdname>statestored</cmdname>
-        daemon:
+        The statestore now has the following relevant configuration flags for the
+        <cmdname>statestored</cmdname> daemon:
       </p>
 
       <dl>
@@ -234,8 +249,8 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The number of threads inside the statestore dedicated to sending topic updates. You should not
-            typically need to change this value.
+            The number of threads inside the statestore dedicated to sending topic updates. You
+            should not typically need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
@@ -250,9 +265,10 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The frequency, in milliseconds, with which the statestore tries to send topic updates to each
-            subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends
-            topic updates as fast as it can. You should not typically need to change this value.
+            The frequency, in milliseconds, with which the statestore tries to send topic
+            updates to each subscriber. This is a best-effort value; if the statestore is unable
+            to meet this frequency, it sends topic updates as fast as it can. You should not
+            typically need to change this value.
             <p>
               <b>Default:</b> 2000
             </p>
@@ -267,8 +283,8 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The number of threads inside the statestore dedicated to sending heartbeats. You should not typically
-            need to change this value.
+            The number of threads inside the statestore dedicated to sending heartbeats. You
+            should not typically need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
@@ -283,9 +299,10 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber.
-            This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that,
-            you might need to increase this value to make the interval longer between heartbeat messages.
+            The frequency, in milliseconds, with which the statestore tries to send heartbeats
+            to each subscriber. This value should be good for large catalogs and clusters up to
+            approximately 150 nodes. Beyond that, you might need to increase this value to make
+            the interval longer between heartbeat messages.
             <p>
               <b>Default:</b> 1000 (one heartbeat message every second)
             </p>
@@ -295,46 +312,54 @@ Memory Usage: Additional Notes
       </dl>
 
       <p>
-        If it takes a very long time for a cluster to start up, and <cmdname>impala-shell</cmdname> consistently
-        displays <codeph>This Impala daemon is not ready to accept user requests</codeph>, the statestore might be
-        taking too long to send the entire catalog topic to the cluster. In this case, consider adding
-        <codeph>--load_catalog_in_background=false</codeph> to your catalog service configuration. This setting
-        stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for
-        each table is loaded when the table is accessed for the first time.
+        If it takes a very long time for a cluster to start up, and
+        <cmdname>impala-shell</cmdname> consistently displays <codeph>This Impala daemon is not
+        ready to accept user requests</codeph>, the statestore might be taking too long to send
+        the entire catalog topic to the cluster. In this case, consider adding
+        <codeph>--load_catalog_in_background=false</codeph> to your catalog service
+        configuration. This setting stops the statestore from loading the entire catalog into
+        memory at cluster startup. Instead, metadata for each table is loaded when the table is
+        accessed for the first time.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_buffer_pool" rev="2.10.0 IMPALA-3200">
+
     <title>Effect of Buffer Pool on Memory Usage (<keyword keyref="impala210"/> and higher)</title>
+
     <conbody>
+
       <p>
-        The buffer pool feature, available in <keyword keyref="impala210"/> and higher, changes the
-        way Impala allocates memory during a query. Most of the memory needed is reserved at the
-        beginning of the query, avoiding cases where a query might run for a long time before failing
-        with an out-of-memory error. The actual memory estimates and memory buffers are typically
-        smaller than before, so that more queries can run concurrently or process larger volumes
-        of data than previously.
+        The buffer pool feature, available in <keyword keyref="impala210"/> and higher, changes
+        the way Impala allocates memory during a query. Most of the memory needed is reserved at
+        the beginning of the query, avoiding cases where a query might run for a long time
+        before failing with an out-of-memory error. The actual memory estimates and memory
+        buffers are typically smaller than before, so that more queries can run concurrently or
+        process larger volumes of data than previously.
       </p>
+
       <p>
         The buffer pool feature includes some query options that you can fine-tune:
-        <xref keyref="buffer_pool_limit"/>,
-        <xref keyref="default_spillable_buffer_size"/>,
-        <xref keyref="max_row_size"/>, and
-        <xref keyref="min_spillable_buffer_size"/>.
-      </p>
-      <p>
-        Most of the effects of the buffer pool are transparent to you as an Impala user.
-        Memory use during spilling is now steadier and more predictable, instead of
-        increasing rapidly as more data is spilled to disk. The main change from a user
-        perspective is the need to increase the <codeph>MAX_ROW_SIZE</codeph> query option
-        setting when querying tables with columns containing long strings, many columns,
-        or other combinations of factors that produce very large rows. If Impala encounters
-        rows that are too large to process with the default query option settings, the query
-        fails with an error message suggesting to increase the <codeph>MAX_ROW_SIZE</codeph>
-        setting.
+        <xref keyref="buffer_pool_limit"/>, <xref keyref="default_spillable_buffer_size"/>,
+        <xref keyref="max_row_size"/>, and <xref keyref="min_spillable_buffer_size"/>.
       </p>
+
+      <p>
+        Most of the effects of the buffer pool are transparent to you as an Impala user. Memory
+        use during spilling is now steadier and more predictable, instead of increasing rapidly
+        as more data is spilled to disk. The main change from a user perspective is the need to
+        increase the <codeph>MAX_ROW_SIZE</codeph> query option setting when querying tables
+        with columns containing long strings, many columns, or other combinations of factors
+        that produce very large rows. If Impala encounters rows that are too large to process
+        with the default query option settings, the query fails with an error message suggesting
+        to increase the <codeph>MAX_ROW_SIZE</codeph> setting.
+      </p>
+
     </conbody>
+
   </concept>
 
   <concept audience="hidden" id="scalability_cluster_size">
@@ -343,9 +368,10 @@ Memory Usage: Additional Notes
 
     <conbody>
 
-      <p>
-      </p>
+      <p></p>
+
     </conbody>
+
   </concept>
 
   <concept audience="hidden" id="concurrent_connections">
@@ -355,7 +381,9 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p></p>
+
     </conbody>
+
   </concept>
 
   <concept rev="2.0.0" id="spill_to_disk">
@@ -365,22 +393,26 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p>
-        Certain memory-intensive operations write temporary data to disk (known as <term>spilling</term> to disk)
-        when Impala is close to exceeding its memory limit on a particular host.
+        Certain memory-intensive operations write temporary data to disk (known as
+        <term>spilling</term> to disk) when Impala is close to exceeding its memory limit on a
+        particular host.
       </p>
 
       <p>
-        The result is a query that completes successfully, rather than failing with an out-of-memory error. The
-        tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back
-        in. The slowdown could be potentially be significant. Thus, while this feature improves reliability,
-        you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence.
+        The result is a query that completes successfully, rather than failing with an
+        out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to
+        write the temporary data and read it back in. The slowdown could be potentially be
+        significant. Thus, while this feature improves reliability, you should optimize your
+        queries, system parameters, and hardware configuration to make this spilling a rare
+        occurrence.
       </p>
 
       <note rev="2.10.0 IMPALA-3200">
         <p>
-          In <keyword keyref="impala210"/> and higher, also see <xref keyref="scalability_buffer_pool"/> for
-          changes to Impala memory allocation that might change the details of which queries spill to disk,
-          and how much memory and disk space is involved in the spilling operation.
+          In <keyword keyref="impala210"/> and higher, also see
+          <xref keyref="scalability_buffer_pool"/> for changes to Impala memory allocation that
+          might change the details of which queries spill to disk, and how much memory and disk
+          space is involved in the spilling operation.
         </p>
       </note>
 
@@ -389,38 +421,42 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism:
+        Several SQL clauses and constructs require memory allocations that could activat the
+        spilling mechanism:
       </p>
+
       <ul>
         <li>
           <p>
-            when a query uses a <codeph>GROUP BY</codeph> clause for columns
-            with millions or billions of distinct values, Impala keeps a
-            similar number of temporary results in memory, to accumulate the
-            aggregate results for each value in the group.
+            when a query uses a <codeph>GROUP BY</codeph> clause for columns with millions or
+            billions of distinct values, Impala keeps a similar number of temporary results in
+            memory, to accumulate the aggregate results for each value in the group.
           </p>
         </li>
+
         <li>
           <p>
-            When large tables are joined together, Impala keeps the values of
-            the join columns from one table in memory, to compare them to
-            incoming values from the other table.
+            When large tables are joined together, Impala keeps the values of the join columns
+            from one table in memory, to compare them to incoming values from the other table.
           </p>
         </li>
+
         <li>
           <p>
-            When a large result set is sorted by the <codeph>ORDER BY</codeph>
-            clause, each node sorts its portion of the result set in memory.
+            When a large result set is sorted by the <codeph>ORDER BY</codeph> clause, each node
+            sorts its portion of the result set in memory.
           </p>
         </li>
+
         <li>
           <p>
-            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators
-            build in-memory data structures to represent all values found so
-            far, to eliminate duplicates as the query progresses.
+            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators build in-memory
+            data structures to represent all values found so far, to eliminate duplicates as the
+            query progresses.
           </p>
         </li>
-        <!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
+
+<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
         <li>
           <p rev="IMPALA-3471">
             In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with
@@ -445,30 +481,32 @@ Memory Usage: Additional Notes
       </p>
 
       <p rev="2.10.0 IMPALA-3200">
-        In <keyword keyref="impala210_full"/> and higher, the way SQL operators such as <codeph>GROUP BY</codeph>,
-        <codeph>DISTINCT</codeph>, and joins, transition between using additional memory or activating the
-        spill-to-disk feature is changed. The memory required to spill to disk is reserved up front, and you can
-        examine it in the <codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> query option is
+        In <keyword keyref="impala210_full"/> and higher, the way SQL operators such as
+        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, transition between
+        using additional memory or activating the spill-to-disk feature is changed. The memory
+        required to spill to disk is reserved up front, and you can examine it in the
+        <codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> query option is
         set to 2 or higher.
       </p>
 
-     <p>
-       The infrastructure of the spilling feature affects the way the affected SQL operators, such as
-       <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory.
-       On each host that participates in the query, each such operator in a query requires memory
-       to store rows of data and other data structures. Impala reserves a certain amount of memory
-       up front for each operator that supports spill-to-disk that is sufficient to execute the
-       operator. If an operator accumulates more data than can fit in the reserved memory, it
-       can either reserve more memory to continue processing data in memory or start spilling
-       data to temporary scratch files on disk. Thus, operators with spill-to-disk support
-       can adapt to different memory constraints by using however much memory is available
-       to speed up execution, yet tolerate low memory conditions by spilling data to disk.
-     </p>
-     
-     <p>
-       The amount data depends on the portion of the data being handled by that host, and thus
-       the operator may end up consuming different amounts of memory on different hosts.
-     </p>
+      <p>
+        The infrastructure of the spilling feature affects the way the affected SQL operators,
+        such as <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. On
+        each host that participates in the query, each such operator in a query requires memory
+        to store rows of data and other data structures. Impala reserves a certain amount of
+        memory up front for each operator that supports spill-to-disk that is sufficient to
+        execute the operator. If an operator accumulates more data than can fit in the reserved
+        memory, it can either reserve more memory to continue processing data in memory or start
+        spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk
+        support can adapt to different memory constraints by using however much memory is
+        available to speed up execution, yet tolerate low memory conditions by spilling data to
+        disk.
+      </p>
+
+      <p>
+        The amount data depends on the portion of the data being handled by that host, and thus
+        the operator may end up consuming different amounts of memory on different hosts.
+      </p>
 
 <!--
       <p>
@@ -505,12 +543,13 @@ Memory Usage: Additional Notes
 -->
 
       <p>
-        <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in Impala 1.4.
-        This feature was extended to cover join queries, aggregation functions, and analytic
-        functions in Impala 2.0. The size of the memory work area required by
-        each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
-        <ph rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take advantage of the
-        Impala buffer pool feature and be more predictable and stable in <keyword keyref="impala210_full"/>.</ph>
+        <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in
+        Impala 1.4. This feature was extended to cover join queries, aggregation functions, and
+        analytic functions in Impala 2.0. The size of the memory work area required by each
+        operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
+        <ph rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take advantage of
+        the Impala buffer pool feature and be more predictable and stable in
+        <keyword keyref="impala210_full"/>.</ph>
       </p>
 
       <p>
@@ -518,28 +557,30 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid
-        this situation by using the following steps:
+        Because the extra I/O can impose significant performance overhead on these types of
+        queries, try to avoid this situation by using the following steps:
       </p>
 
       <ol>
         <li>
-          Detect how often queries spill to disk, and how much temporary data is written. Refer to the following
-          sources:
+          Detect how often queries spill to disk, and how much temporary data is written. Refer
+          to the following sources:
           <ul>
             <li>
-              The output of the <codeph>PROFILE</codeph> command in the <cmdname>impala-shell</cmdname>
-              interpreter. This data shows the memory usage for each host and in total across the cluster. The
-              <codeph>WriteIoBytes</codeph> counter reports how much data was written to disk for each operator
-              during the query. (In <keyword keyref="impala29_full"/>, the counter was named
-              <codeph>ScratchBytesWritten</codeph>; in <keyword keyref="impala28_full"/> and earlier, it was named
-              <codeph>BytesWritten</codeph>.)
+              The output of the <codeph>PROFILE</codeph> command in the
+              <cmdname>impala-shell</cmdname> interpreter. This data shows the memory usage for
+              each host and in total across the cluster. The <codeph>WriteIoBytes</codeph>
+              counter reports how much data was written to disk for each operator during the
+              query. (In <keyword keyref="impala29_full"/>, the counter was named
+              <codeph>ScratchBytesWritten</codeph>; in <keyword keyref="impala28_full"/> and
+              earlier, it was named <codeph>BytesWritten</codeph>.)
             </li>
 
             <li>
-              The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. Select the query to
-              examine and click the corresponding <uicontrol>Profile</uicontrol> link. This data breaks down the
-              memory usage for a single host within the cluster, the host whose web interface you are connected to.
+              The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface.
+              Select the query to examine and click the corresponding
+              <uicontrol>Profile</uicontrol> link. This data breaks down the memory usage for a
+              single host within the cluster, the host whose web interface you are connected to.
             </li>
           </ul>
         </li>
@@ -548,15 +589,16 @@ Memory Usage: Additional Notes
           Use one or more techniques to reduce the possibility of the queries spilling to disk:
           <ul>
             <li>
-              Increase the Impala memory limit if practical, for example, if you can increase the available memory
-              by more than the amount of temporary data written to disk on a particular node. Remember that in
-              Impala 2.0 and later, you can issue <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you
-              fine-tune the memory usage for queries from JDBC and ODBC applications.
+              Increase the Impala memory limit if practical, for example, if you can increase
+              the available memory by more than the amount of temporary data written to disk on
+              a particular node. Remember that in Impala 2.0 and later, you can issue
+              <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you fine-tune the
+              memory usage for queries from JDBC and ODBC applications.
             </li>
 
             <li>
-              Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and
-              reduce the amount of memory required on each node.
+              Increase the number of nodes in the cluster, to increase the aggregate memory
+              available to Impala and reduce the amount of memory required on each node.
             </li>
 
             <li>
@@ -564,54 +606,57 @@ Memory Usage: Additional Notes
             </li>
 
             <li>
-              On a cluster with resources shared between Impala and other Hadoop components, use resource
-              management features to allocate more memory for Impala. See
+              On a cluster with resources shared between Impala and other Hadoop components, use
+              resource management features to allocate more memory for Impala. See
               <xref href="impala_resource_management.xml#resource_management"/> for details.
             </li>
 
             <li>
-              If the memory pressure is due to running many concurrent queries rather than a few memory-intensive
-              ones, consider using the Impala admission control feature to lower the limit on the number of
-              concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in
-              memory usage and improve overall response times. See
-              <xref href="impala_admission.xml#admission_control"/> for details.
+              If the memory pressure is due to running many concurrent queries rather than a few
+              memory-intensive ones, consider using the Impala admission control feature to
+              lower the limit on the number of concurrent queries. By spacing out the most
+              resource-intensive queries, you can avoid spikes in memory usage and improve
+              overall response times. See <xref href="impala_admission.xml#admission_control"/>
+              for details.
             </li>
 
             <li>
-              Tune the queries with the highest memory requirements, using one or more of the following techniques:
+              Tune the queries with the highest memory requirements, using one or more of the
+              following techniques:
               <ul>
                 <li>
-                  Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in large-scale joins and
-                  aggregation queries.
+                  Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in
+                  large-scale joins and aggregation queries.
                 </li>
 
                 <li>
-                  Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer numeric values
-                  instead.
+                  Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer
+                  numeric values instead.
                 </li>
 
                 <li>
-                  Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy being used for the
-                  most resource-intensive queries. See <xref href="impala_explain_plan.xml#perf_explain"/> for
-                  details.
+                  Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy
+                  being used for the most resource-intensive queries. See
+                  <xref href="impala_explain_plan.xml#perf_explain"/> for details.
                 </li>
 
                 <li>
-                  If Impala still chooses a suboptimal execution strategy even with statistics available, or if it
-                  is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints
-                  to the most resource-intensive queries to select the right execution strategy. See
+                  If Impala still chooses a suboptimal execution strategy even with statistics
+                  available, or if it is impractical to keep the statistics up to date for huge
+                  or rapidly changing tables, add hints to the most resource-intensive queries
+                  to select the right execution strategy. See
                   <xref href="impala_hints.xml#hints"/> for details.
                 </li>
               </ul>
             </li>
 
             <li>
-              If your queries experience substantial performance overhead due to spilling, enable the
-              <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option prevents queries whose memory usage
-              is likely to be exorbitant from spilling to disk. See
-              <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As you tune
-              problematic queries using the preceding steps, fewer and fewer will be cancelled by this option
-              setting.
+              If your queries experience substantial performance overhead due to spilling,
+              enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option
+              prevents queries whose memory usage is likely to be exorbitant from spilling to
+              disk. See <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
+              for details. As you tune problematic queries using the preceding steps, fewer and
+              fewer will be cancelled by this option setting.
             </li>
           </ul>
         </li>
@@ -622,22 +667,24 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        To artificially provoke spilling, to test this feature and understand the performance implications, use a
-        test environment with a memory limit of at least 2 GB. Issue the <codeph>SET</codeph> command with no
-        arguments to check the current setting for the <codeph>MEM_LIMIT</codeph> query option. Set the query
-        option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk feature to prevent
-        runaway disk usage from queries that are known in advance to be suboptimal. Within
-        <cmdname>impala-shell</cmdname>, run a query that you expect to be memory-intensive, based on the criteria
-        explained earlier. A self-join of a large table is a good candidate:
+        To artificially provoke spilling, to test this feature and understand the performance
+        implications, use a test environment with a memory limit of at least 2 GB. Issue the
+        <codeph>SET</codeph> command with no arguments to check the current setting for the
+        <codeph>MEM_LIMIT</codeph> query option. Set the query option
+        <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk
+        feature to prevent runaway disk usage from queries that are known in advance to be
+        suboptimal. Within <cmdname>impala-shell</cmdname>, run a query that you expect to be
+        memory-intensive, based on the criteria explained earlier. A self-join of a large table
+        is a good candidate:
       </p>
 
 <codeblock>select count(*) from big_table a join big_table b using (column_with_many_values);
 </codeblock>
 
       <p>
-        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory usage on each node
-        during the query.
-        <!--
+        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory
+        usage on each node during the query.
+<!--
         The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph>
         portion. For example, this profile shows that the query did not quite exceed the memory limit.
         -->
@@ -668,8 +715,8 @@ Memory Usage: Additional Notes
 -->
 
       <p>
-        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak memory usage
-        reported in the profile output. Now try the memory-intensive query again.
+        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak
+        memory usage reported in the profile output. Now try the memory-intensive query again.
       </p>
 
       <p>
@@ -682,46 +729,50 @@ these tables, hint the plan or disable this behavior via query options to enable
 </codeblock>
 
       <p>
-        If so, the query could have consumed substantial temporary disk space, slowing down so much that it would
-        not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the
-        <codeph>COMPUTE STATS</codeph> statement for the table or tables in your sample query. Then run the query
-        again, check the peak memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory
-        limit again if necessary to be lower than the peak memory usage.
+        If so, the query could have consumed substantial temporary disk space, slowing down so
+        much that it would not complete in any reasonable time. Rather than rely on the
+        spill-to-disk feature in this case, issue the <codeph>COMPUTE STATS</codeph> statement
+        for the table or tables in your sample query. Then run the query again, check the peak
+        memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory limit
+        again if necessary to be lower than the peak memory usage.
       </p>
 
       <p>
-        At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that
-        the memory usage is not exorbitant. You have set an artificial constraint through the
-        <codeph>MEM_LIMIT</codeph> option so that the query would normally fail with an out-of-memory error. But
-        the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some
-        extra disk I/O to read and write temporary work data.
+        At this point, you have a query that is memory-intensive, but Impala can optimize it
+        efficiently so that the memory usage is not exorbitant. You have set an artificial
+        constraint through the <codeph>MEM_LIMIT</codeph> option so that the query would
+        normally fail with an out-of-memory error. But the automatic spill-to-disk feature means
+        that the query should actually succeed, at the expense of some extra disk I/O to read
+        and write temporary work data.
       </p>
 
       <p>
-        Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> output again. This
-        time, look for lines of this form:
+        Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph>
+        output again. This time, look for lines of this form:
       </p>
 
 <codeblock>- SpilledPartitions: <varname>N</varname>
 </codeblock>
 
       <p>
-        If you see any such lines with <varname>N</varname> greater than 0, that indicates the query would have
-        failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine
-        the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
-        <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that did not spill, for
-        example in the <codeph>PROFILE</codeph> output when the same query is run with a higher memory limit. This
-        gives you an idea of the performance penalty of the spill operation for a particular query with a
-        particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the
-        query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the
+        If you see any such lines with <varname>N</varname> greater than 0, that indicates the
+        query would have failed in Impala releases prior to 2.0, but now it succeeded because of
+        the spill-to-disk feature. Examine the total time taken by the
+        <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
+        <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that
+        did not spill, for example in the <codeph>PROFILE</codeph> output when the same query is
+        run with a higher memory limit. This gives you an idea of the performance penalty of the
+        spill operation for a particular query with a particular memory limit. If you make the
+        memory limit just a little lower than the peak memory usage, the query only needs to
+        write a small amount of temporary data to disk. The lower you set the memory limit, the
         more temporary data is written and the slower the query becomes.
       </p>
 
       <p>
         Now repeat this procedure for actual queries used in your environment. Use the
-        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more memory than
-        necessary due to lack of statistics on the relevant tables and columns, and issue <codeph>COMPUTE
-        STATS</codeph> where necessary.
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more
+        memory than necessary due to lack of statistics on the relevant tables and columns, and
+        issue <codeph>COMPUTE STATS</codeph> where necessary.
       </p>
 
       <p>
@@ -729,242 +780,307 @@ these tables, hint the plan or disable this behavior via query options to enable
       </p>
 
       <p>
-        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the time. Whether and
-        how frequently to use this option depends on your system environment and workload.
+        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the
+        time. Whether and how frequently to use this option depends on your system environment
+        and workload.
       </p>
 
       <p>
-        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc queries whose performance
-        characteristics and memory usage are not known in advance. It prevents <q>worst-case scenario</q> queries
-        that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while
-        developing new SQL code, even though it is turned off for existing applications.
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc
+        queries whose performance characteristics and memory usage are not known in advance. It
+        prevents <q>worst-case scenario</q> queries that use large amounts of memory
+        unnecessarily. Thus, you might turn this option on within a session while developing new
+        SQL code, even though it is turned off for existing applications.
       </p>
 
       <p>
-        Organizations where table and column statistics are generally up-to-date might leave this option turned on
-        all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline
-        results in a table with no statistics. Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail
-        fast</q> in this case and immediately gather statistics or tune the problematic queries.
+        Organizations where table and column statistics are generally up-to-date might leave
+        this option turned on all the time, again to avoid worst-case scenarios for untested
+        queries or if a problem in the ETL pipeline results in a table with no statistics.
+        Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail fast</q> in this case
+        and immediately gather statistics or tune the problematic queries.
       </p>
 
       <p>
-        Some organizations might leave this option turned off. For example, you might have tables large enough that
-        the <codeph>COMPUTE STATS</codeph> takes substantial time to run, making it impractical to re-run after
-        loading new data. If you have examined the <codeph>EXPLAIN</codeph> plans of your queries and know that
-        they are operating efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
-        case, you know that any queries that spill will not go overboard with their memory consumption.
+        Some organizations might leave this option turned off. For example, you might have
+        tables large enough that the <codeph>COMPUTE STATS</codeph> takes substantial time to
+        run, making it impractical to re-run after loading new data. If you have examined the
+        <codeph>EXPLAIN</codeph> plans of your queries and know that they are operating
+        efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
+        case, you know that any queries that spill will not go overboard with their memory
+        consumption.
       </p>
 
     </conbody>
+
   </concept>
 
-<concept id="complex_query">
-<title>Limits on Query Size and Complexity</title>
-<conbody>
-<p>
-There are hardcoded limits on the maximum size and complexity of queries.
-Currently, the maximum number of expressions in a query is 2000.
-You might exceed the limits with large or deeply nested queries
-produced by business intelligence tools or other query generators.
-</p>
-<p>
-If you have the ability to customize such queries or the query generation
-logic that produces them, replace sequences of repetitive expressions
-with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph>
-that can represent multiple values or ranges.
-For example, instead of a large number of <codeph>OR</codeph> clauses:
-</p>
+  <concept id="complex_query">
+
+    <title>Limits on Query Size and Complexity</title>
+
+    <conbody>
+
+      <p>
+        There are hardcoded limits on the maximum size and complexity of queries. Currently, the
+        maximum number of expressions in a query is 2000. You might exceed the limits with large
+        or deeply nested queries produced by business intelligence tools or other query
+        generators.
+      </p>
+
+      <p>
+        If you have the ability to customize such queries or the query generation logic that
+        produces them, replace sequences of repetitive expressions with single operators such as
+        <codeph>IN</codeph> or <codeph>BETWEEN</codeph> that can represent multiple values or
+        ranges. For example, instead of a large number of <codeph>OR</codeph> clauses:
+      </p>
+
 <codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
 </codeblock>
-<p>
-use a single <codeph>IN</codeph> clause:
-</p>
+
+      <p>
+        use a single <codeph>IN</codeph> clause:
+      </p>
+
 <codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
-</conbody>
-</concept>
 
-<concept id="scalability_io">
-<title>Scalability Considerations for Impala I/O</title>
-<conbody>
-<p>
-Impala parallelizes its I/O operations aggressively,
-therefore the more disks you can attach to each host, the better.
-Impala retrieves data from disk so quickly using
-bulk read operations on large blocks, that most queries
-are CPU-bound rather than I/O-bound.
-</p>
-<p>
-Because the kind of sequential scanning typically done by
-Impala queries does not benefit much from the random-access
-capabilities of SSDs, spinning disks typically provide
-the most cost-effective kind of storage for Impala data,
-with little or no performance penalty as compared to SSDs.
-</p>
-<p>
-Resource management features such as YARN, Llama, and admission control
-typically constrain the amount of memory, CPU, or overall number of
-queries in a high-concurrency environment.
-Currently, there is no throttling mechanism for Impala I/O.
-</p>
-</conbody>
-</concept>
+    </conbody>
+
+  </concept>
+
+  <concept id="scalability_io">
+
+    <title>Scalability Considerations for Impala I/O</title>
+
+    <conbody>
+
+      <p>
+        Impala parallelizes its I/O operations aggressively, therefore the more disks you can
+        attach to each host, the better. Impala retrieves data from disk so quickly using bulk
+        read operations on large blocks, that most queries are CPU-bound rather than I/O-bound.
+      </p>
+
+      <p>
+        Because the kind of sequential scanning typically done by Impala queries does not
+        benefit much from the random-access capabilities of SSDs, spinning disks typically
+        provide the most cost-effective kind of storage for Impala data, with little or no
+        performance penalty as compared to SSDs.
+      </p>
+
+      <p>
+        Resource management features such as YARN, Llama, and admission control typically
+        constrain the amount of memory, CPU, or overall number of queries in a high-concurrency
+        environment. Currently, there is no throttling mechanism for Impala I/O.
+      </p>
+
+    </conbody>
+
+  </concept>
 
   <concept id="big_tables">
+
     <title>Scalability Considerations for Table Layout</title>
+
     <conbody>
+
       <p>
-        Due to the overhead of retrieving and updating table metadata
-        in the metastore database, try to limit the number of columns
-        in a table to a maximum of approximately 2000.
-        Although Impala can handle wider tables than this, the metastore overhead
-        can become significant, leading to query performance that is slower
-        than expected based on the actual data volume.
+        Due to the overhead of retrieving and updating table metadata in the metastore database,
+        try to limit the number of columns in a table to a maximum of approximately 2000.
+        Although Impala can handle wider tables than this, the metastore overhead can become
+        significant, leading to query performance that is slower than expected based on the
+        actual data volume.
       </p>
+
       <p>
-        To minimize overhead related to the metastore database and Impala query planning,
-        try to limit the number of partitions for any partitioned table to a few tens of thousands.
+        To minimize overhead related to the metastore database and Impala query planning, try to
+        limit the number of partitions for any partitioned table to a few tens of thousands.
       </p>
+
       <p rev="IMPALA-5309">
-        If the volume of data within a table makes it impractical to run exploratory
-        queries, consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing
-        to only a percentage of data within the table. This technique reduces the overhead
-        for query startup, I/O to read the data, and the amount of network, CPU, and memory
-        needed to process intermediate results during the query. See <xref keyref="tablesample"/>
-        for details.
+        If the volume of data within a table makes it impractical to run exploratory queries,
+        consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing to only
+        a percentage of data within the table. This technique reduces the overhead for query
+        startup, I/O to read the data, and the amount of network, CPU, and memory needed to
+        process intermediate results during the query. See <xref keyref="tablesample"/> for
+        details.
       </p>
+
     </conbody>
+
   </concept>
 
-<concept rev="" id="kerberos_overhead_cluster_size">
-<title>Kerberos-Related Network Overhead for Large Clusters</title>
-<conbody>
-<p>
-When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a number of
-simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process
-all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process
-the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same
-time as these Kerberos-related requests.
-</p>
-<p>
-While these authentication requests are being processed, any submitted Impala queries will fail.
-During this period, the KDC and DNS may be slow to respond to requests from components other than Impala,
-so other secure services might be affected temporarily.
-</p>
-  <p>
-    In <keyword keyref="impala212_full"/> or earlier, to reduce the
-    frequency of the <codeph>kinit</codeph> renewal that initiates a new set
-    of authentication requests, increase the <codeph>kerberos_reinit_interval</codeph>
-    configuration setting for the <codeph>impalad</codeph> daemons. Currently,
-    the default is 60 minutes. Consider using a higher value such as 360 (6 hours).
-      </p>
-  <p>
-    The <codeph>kerberos_reinit_interval</codeph> configuration setting is removed
-    in <keyword keyref="impala30_full"/>, and the above step is no longer needed.
-  </p>
-
-</conbody>
-</concept>
+  <concept rev="" id="kerberos_overhead_cluster_size">
+
+    <title>Kerberos-Related Network Overhead for Large Clusters</title>
+
+    <conbody>
+
+      <p>
+        When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a
+        number of simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might
+        be able to process all the requests within roughly 5 seconds. For a cluster with 1000
+        hosts, the time to process the requests would be roughly 500 seconds. Impala also makes
+        a number of DNS requests at the same time as these Kerberos-related requests.
+      </p>
+
+      <p>
+        While these authentication requests are being processed, any submitted Impala queries
+        will fail. During this period, the KDC and DNS may be slow to respond to requests from
+        components other than Impala, so other secure services might be affected temporarily.
+      </p>
+
+      <p>
+        In <keyword keyref="impala212_full"/> or earlier, to reduce the frequency of the
+        <codeph>kinit</codeph> renewal that initiates a new set of authentication requests,
+        increase the <codeph>kerberos_reinit_interval</codeph> configuration setting for the
+        <codeph>impalad</codeph> daemons. Currently, the default is 60 minutes. Consider using a
+        higher value such as 360 (6 hours).
+      </p>
+
+      <p>
+        The <codeph>kerberos_reinit_interval</codeph> configuration setting is removed in
+        <keyword keyref="impala30_full"/>, and the above step is no longer needed.
+      </p>
+
+    </conbody>
+
+  </concept>
 
   <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
+
     <title>Avoiding CPU Hotspots for HDFS Cached Data</title>
+
     <conbody>
+
       <p>
-        You can use the HDFS caching feature, described in <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
-        with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
+        You can use the HDFS caching feature, described in
+        <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to reduce I/O and
+        memory-to-memory copying for frequently accessed tables or partitions.
       </p>
+
       <p>
         In the early days of this feature, you might have found that enabling HDFS caching
         resulted in little or no performance improvement, because it could result in
-        <q>hotspots</q>: instead of the I/O to read the table data being parallelized across
-        the cluster, the I/O was reduced but the CPU load to process the data blocks
-        might be concentrated on a single host.
+        <q>hotspots</q>: instead of the I/O to read the table data being parallelized across the
+        cluster, the I/O was reduced but the CPU load to process the data blocks might be
+        concentrated on a single host.
       </p>
+
       <p>
         To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the
-        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that use HDFS caching.
-        This clause allows more than one host to cache the relevant data blocks, so the CPU load
-        can be shared, reducing the load on any one host.
-        See <xref href="impala_create_table.xml#create_table"/> and <xref href="impala_alter_table.xml#alter_table"/>
-        for details.
+        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that
+        use HDFS caching. This clause allows more than one host to cache the relevant data
+        blocks, so the CPU load can be shared, reducing the load on any one host. See
+        <xref href="impala_create_table.xml#create_table"/> and
+        <xref href="impala_alter_table.xml#alter_table"/> for details.
       </p>
+
       <p>
         Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
-        the way that Impala schedules the work of processing data blocks on different hosts.
-        In <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work for
-        HDFS cached data is divided better among all the hosts that have cached replicas
-        for a particular data block. When more than one host has a cached replica for a data block,
-        Impala assigns the work of processing that block to whichever host has done the least work
-        (in terms of number of bytes read) for the current query. If hotspots persist even with this
-        load-based scheduling algorithm, you can enable the query option <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph>
-        to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached
-        data block if the scheduling algorithm encounters a tie when deciding which host has done the
-        least work.
+        the way that Impala schedules the work of processing data blocks on different hosts. In
+        <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work
+        for HDFS cached data is divided better among all the hosts that have cached replicas for
+        a particular data block. When more than one host has a cached replica for a data block,
+        Impala assigns the work of processing that block to whichever host has done the least
+        work (in terms of number of bytes read) for the current query. If hotspots persist even
+        with this load-based scheduling algorithm, you can enable the query option
+        <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> to further distribute the CPU load. This
+        setting causes Impala to randomly pick a host to process a cached data block if the
+        scheduling algorithm encounters a tie when deciding which host has done the least work.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">
+
     <title>Scalability Considerations for NameNode Traffic with File Handle Caching</title>
+
     <conbody>
+
       <p>
         One scalability aspect that affects heavily loaded clusters is the load on the HDFS
-        NameNode, from looking up the details as each HDFS file is opened. Impala queries
-        often access many different HDFS files, for example if a query does a full table scan
-        on a table with thousands of partitions, each partition containing multiple data files.
-        Accessing each column of a Parquet file also involves a separate <q>open</q> call,
-        further increasing the load on the NameNode. High NameNode overhead can add startup time
-        (that is, increase latency) to Impala queries, and reduce overall throughput for non-Impala
-        workloads that also require accessing HDFS files.
-      </p>
-      <p> In <keyword keyref="impala210_full"/> and higher, you can reduce
-        NameNode overhead by enabling a caching feature for HDFS file handles.
-        Data files that are accessed by different queries, or even multiple
-        times within the same query, can be accessed without a new <q>open</q>
-        call and without fetching the file details again from the NameNode. </p>
-      <p>
-        Because this feature only involves HDFS data files, it does not apply to non-HDFS tables,
-        such as Kudu or HBase tables, or tables that store their data on cloud services such as
-        S3 or ADLS. Any read operations that perform remote reads also skip the cached file handles.
-      </p>
-      <p> The feature is enabled by default with 20,000 file handles to be
-        cached. To change the value, set the configuration option
-          <codeph>max_cached_file_handles</codeph> to a non-zero value for each
-          <cmdname>impalad</cmdname> daemon. From the initial default value of
-        20000, adjust upward if NameNode request load is still significant, or
-        downward if it is more important to reduce the extra memory usage on
-        each host. Each cache entry consumes 6 KB, meaning that caching 20,000
-        file handles requires up to 120 MB on each Impala executor. The exact
-        memory usage varies depending on how many file handles have actually
-        been cached; memory is freed as file handles are evicted from the cache. </p>
-      <p>
-        If a manual HDFS operation moves a file to the HDFS Trashcan while the file handle is cached,
-        Impala still accesses the contents of that file. This is a change from prior behavior. Previously,
-        accessing a file that was in the trashcan would cause an error. This behavior only applies to
-        non-Impala methods of removing HDFS files, not the Impala mechanisms such as <codeph>TRUNCATE TABLE</codeph>
-        or <codeph>DROP TABLE</codeph>.
-      </p>
-      <p>
-        If files are removed, replaced, or appended by HDFS operations outside of Impala, the way to bring the
-        file information up to date is to run the <codeph>REFRESH</codeph> statement on the table.
-      </p>
-      <p>
-        File handle cache entries are evicted as the cache fills up, or based on a timeout period
-        when they have not been accessed for some time.
-      </p>
-      <p>
-        To evaluate the effectiveness of file handle caching for a particular workload, issue the
-        <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> or examine query
-        profiles in the Impala web UI. Look for the ratio of <codeph>CachedFileHandlesHitCount</codeph>
-        (ideally, should be high) to <codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low).
-        Before starting any evaluation, run some representative queries to <q>warm up</q> the cache,
-        because the first time each data file is accessed is always recorded as a cache miss.
+        NameNode from looking up the details as each HDFS file is opened. Impala queries often
+        access many different HDFS files. For example, a query that does a full table scan on a
+        partitioned table may need to read thousands of partitions, each partition containing
+        multiple data files. Accessing each column of a Parquet file also involves a separate
+        <q>open</q> call, further increasing the load on the NameNode. High NameNode overhead
+        can add startup time (that is, increase latency) to Impala queries, and reduce overall
+        throughput for non-Impala workloads that also require accessing HDFS files.
+      </p>
+
+      <p>
+        In <keyword keyref="impala210_full"/> and higher, you can reduce NameNode overhead by
+        enabling a caching feature for HDFS file handles. Data files that are accessed by
+        different queries, or even multiple times within the same query, can be accessed without
+        a new <q>open</q> call and without fetching the file details again from the NameNode.
+      </p>
+
+      <p>
+        In Impala 3.2 and higher, file handle caching also applies to remote HDFS file handles.
+        This is controlled by the <codeph>cache_remote_file_handles</codeph> flag for an
+        <codeph>impalad</codeph>. It is recommended that you use the default value of
+        <codeph>true</codeph> as this caching prevents your NameNode from overloading when your
+        cluster has many remote HDFS reads.
+      </p>
+
+      <p>
+        Because this feature only involves HDFS data files, it does not apply to non-HDFS
+        tables, such as Kudu or HBase tables, or tables that store their data on cloud services
+        such as S3 or ADLS.
+      </p>
+
+      <p>
+        The feature is enabled by default with 20,000 file handles to be cached. To change the
+        value, set the configuration option <codeph>max_cached_file_handles</codeph> to a
+        non-zero value for each <cmdname>impalad</cmdname> daemon. From the initial default
+        value of 20000, adjust upward if NameNode request load is still significant, or downward
+        if it is more important to reduce the extra memory usage on each host. Each cache entry
+        consumes 6 KB, meaning that caching 20,000 file handles requires up to 120 MB on each
+        Impala executor. The exact memory usage varies depending on how many file handles have
+        actually been cached; memory is freed as file handles are evicted from the cache.
+      </p>
+
+      <p>
+        If a manual HDFS operation moves a file to the HDFS Trashcan while the file handle is
+        cached, Impala still accesses the contents of that file. This is a change from prior
+        behavior. Previously, accessing a file that was in the trashcan would cause an error.
+        This behavior only applies to non-Impala methods of removing HDFS files, not the Impala
+        mechanisms such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>.
+      </p>
+
+      <p>
+        If files are removed, replaced, or appended by HDFS operations outside of Impala, the
+        way to bring the file information up to date is to run the <codeph>REFRESH</codeph>
+        statement on the table.
+      </p>
+
+      <p>
+        File handle cache entries are evicted as the cache fills up, or based on a timeout
+        period when they have not been accessed for some time.
+      </p>
+
+      <p>
+        To evaluate the effectiveness of file handle caching for a particular workload, issue
+        the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> or examine
+        query profiles in the Impala Web UI. Look for the ratio of
+        <codeph>CachedFileHandlesHitCount</codeph> (ideally, should be high) to
+        <codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low). Before starting
+        any evaluation, run several representative queries to <q>warm up</q> the cache because
+        the first time each data file is accessed is always recorded as a cache miss.
+      </p>
+
+      <p>
         To see metrics about file handle caching for each <cmdname>impalad</cmdname> instance,
-        examine the <uicontrol>/metrics</uicontrol> page in the Impala web UI, in particular the fields
-        <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>,
+        examine the <uicontrol>/metrics</uicontrol> page in the Impala Web UI, in particular the
+        fields <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>,
         <uicontrol>impala-server.io.mgr.cached-file-handles-hit-count</uicontrol>, and
         <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>.
       </p>
+
     </conbody>
+
   </concept>
 
 </concept>

[impala] 02/05: IMPALA-8154: Disable Kerberos auth_to_local setting

Posted by ar...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit bf96eb30a2b96a945fc7c10716252ea37dc665f5
Author: Michael Ho <kw...@cloudera.com>
AuthorDate: Thu Feb 7 11:48:27 2019 -0800

    IMPALA-8154: Disable Kerberos auth_to_local setting
    
    Before KRPC, the local name mapping was done from the principal name entirely.
    With KRPC, Impala started to use the system auth_to_local rules as the Kudu
    security code has "--use_system_auth_to_local=true" by default. This can cause
    regression if local auth is configured in the krb5.conf (e.g. with  SSSD with AD)
    as we started enforcing authorization based on Kerberos principal after this
    commit (https://github.com/apache/impala/commit/5c541b960491ba91533712144599fb3b6d99521d)
    
    This change fixes the problem by explicitly setting FLAGS_use_system_auth_to_local
    to false during initialization.
    
    Testing done: Enabled auth_to_local in a Kerberized cluster to map "impala/<hostname>"
    to foobar and verified queries still worked as expected.
    
    Change-Id: I0b0ad79b56cd5cdd3108c6f973e71a9416efbac8
    Reviewed-on: http://gerrit.cloudera.org:8080/12405
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/rpc/authentication.cc | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/be/src/rpc/authentication.cc b/be/src/rpc/authentication.cc
index 0b5b5d9..072ccdf 100644
--- a/be/src/rpc/authentication.cc
+++ b/be/src/rpc/authentication.cc
@@ -72,6 +72,9 @@ DECLARE_string(krb5_ccname);
 DECLARE_string(krb5_conf);
 DECLARE_string(krb5_debug_file);
 
+// Defined in kudu/security/init.cc
+DECLARE_bool(use_system_auth_to_local);
+
 DEFINE_string(sasl_path, "", "Colon separated list of paths to look for SASL "
     "security library plugins.");
 DEFINE_bool(enable_ldap_auth, false,
@@ -784,6 +787,8 @@ Status SaslAuthProvider::Start() {
   if (needs_kinit_) {
     DCHECK(is_internal_);
     DCHECK(!principal_.empty());
+    // IMPALA-8154: Disable any Kerberos auth_to_local mappings.
+    FLAGS_use_system_auth_to_local = false;
     // Starts a thread that periodically does a 'kinit'. The thread lives as long as the
     // process does.
     KUDU_RETURN_IF_ERROR(kudu::security::InitKerberosForServer(principal_, keytab_file_,

[impala] 03/05: IMPALA-8098: [DOCS] Port change for the SHUTDOWN command

Posted by ar...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 29cdaf022ca7310b955ecb30eeebf126ac389b14
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Tue Feb 5 18:04:20 2019 -0800

    IMPALA-8098: [DOCS] Port change for the SHUTDOWN command
    
    Change-Id: I77fcc13a67017a062f0be7cb15f22373bf66cb4c
    Reviewed-on: http://gerrit.cloudera.org:8080/12379
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Andrew Sherman <as...@cloudera.com>
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
---
 docs/impala_keydefs.ditamap                 |  2 ++
 docs/topics/impala_incompatible_changes.xml |  9 +++++++++
 docs/topics/impala_shutdown.xml             | 17 ++++++++++++++---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap
index 878e99d..39e84ef 100644
--- a/docs/impala_keydefs.ditamap
+++ b/docs/impala_keydefs.ditamap
@@ -10521,6 +10521,7 @@ under the License.
   <keydef href="https://issues.apache.org/jira/browse/IMPALA-9999" scope="external" format="html" keys="IMPALA-9999"/>
 
 <!-- Short form of mapping from Impala release to vendor-specific releases, for use in headings. -->
+  <keydef keys="impala32"><topicmeta><keywords><keyword>Impala 3.2</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala31"><topicmeta><keywords><keyword>Impala 3.1</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala30"><topicmeta><keywords><keyword>Impala 3.0</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala212"><topicmeta><keywords><keyword>Impala 2.12</keyword></keywords></topicmeta></keydef>
@@ -10583,6 +10584,7 @@ under the License.
   <keydef keys="impala132"><topicmeta><keywords><keyword>Impala 1.3.2</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala130"><topicmeta><keywords><keyword>Impala 1.3.0</keyword></keywords></topicmeta></keydef>
 
+  <keydef keys="impala32_full"><topicmeta><keywords><keyword>Impala 3.2</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala31_full"><topicmeta><keywords><keyword>Impala 3.1</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala30_full"><topicmeta><keywords><keyword>Impala 3.0</keyword></keywords></topicmeta></keydef>
   <keydef keys="impala212_full"><topicmeta><keywords><keyword>Impala 2.12</keyword></keywords></topicmeta></keydef>
diff --git a/docs/topics/impala_incompatible_changes.xml b/docs/topics/impala_incompatible_changes.xml
index ab88251..c4497a4 100644
--- a/docs/topics/impala_incompatible_changes.xml
+++ b/docs/topics/impala_incompatible_changes.xml
@@ -52,6 +52,15 @@ under the License.
 
     <p outputclass="toc inpage"/>
   </conbody>
+  <concept rev="3.1.0" id="concept_vnw_r4v_rgb">
+    <title>Incompatible Changes Introduced in Impala 3.2.x</title>
+    <conbody>
+      <p> For the full list of issues closed in this release, including any that
+        introduce behavior changes or incompatibilities, see the <xref
+          keyref="changelog_32">changelog for <keyword keyref="impala32"
+          /></xref>. </p>
+    </conbody>
+  </concept>
   <concept rev="3.1.0" id="new_features_31">
     <title>Incompatible Changes Introduced in Impala 3.1.x</title>
     <conbody>
diff --git a/docs/topics/impala_shutdown.xml b/docs/topics/impala_shutdown.xml
index 1677fac..cc636cd 100644
--- a/docs/topics/impala_shutdown.xml
+++ b/docs/topics/impala_shutdown.xml
@@ -109,12 +109,23 @@ under the License.
 
         </stentry>
 
-        <stentry><codeph>0</codeph> which is treated the same port as current
-            <codeph>impalad</codeph>
+        <stentry>
+
+          <ul>
+            <li>
+              In Impala 3.1, the same backend port as the current <codeph>impalad</codeph>.
+            </li>
+
+            <li>
+              In Impala 3.2 and higher, the KRPC port for the current <codeph>impalad</codeph>.
+            </li>
+          </ul>
 
         </stentry>
 
-        <stentry>n/a</stentry>
+        <stentry>Specifies the port by which the <codeph>impalad</codeph> can be
+          contacted. Use the KRPC port for your
+          <codeph>impalad</codeph>.</stentry>
 
       </strow>

[impala] 04/05: [DOCS] Re-formatted impala_ldap.xml

Posted by ar...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 5f59eefeb71688913e24263a1724c95039e12c54
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Fri Feb 8 16:57:22 2019 -0800

    [DOCS] Re-formatted impala_ldap.xml
    
    Change-Id: I4b6e912ae08fa7d0b3a110aff3dd2d991b10f3d8
    Reviewed-on: http://gerrit.cloudera.org:8080/12418
    Reviewed-by: Alex Rodoni <ar...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 docs/topics/impala_ldap.xml | 443 ++++++++++++++++++++++++++++----------------
 1 file changed, 281 insertions(+), 162 deletions(-)

diff --git a/docs/topics/impala_ldap.xml b/docs/topics/impala_ldap.xml
index b01eb64..85d6b34 100644
--- a/docs/topics/impala_ldap.xml
+++ b/docs/topics/impala_ldap.xml
@@ -21,6 +21,7 @@ under the License.
 <concept id="ldap">
 
   <title>Enabling LDAP Authentication for Impala</title>
+
   <prolog>
     <metadata>
       <data name="Category" value="Security"/>
@@ -35,115 +36,124 @@ under the License.
 
   <conbody>
 
-<!-- Similar discussion under 'Authentication' parent topic. Maybe do some conref'ing or linking upward. -->
-
-    <p> Authentication is the process of allowing only specified named users to
-      access the server (in this case, the Impala server). This feature is
-      crucial for any production deployment, to prevent misuse, tampering, or
-      excessive load on the server. Impala uses LDAP for authentication,
-      verifying the credentials of each user who connects through
-        <cmdname>impala-shell</cmdname>, Hue, a Business Intelligence tool, JDBC
-      or ODBC application, and so on. </p>
+    <p>
+      Authentication is the process of allowing only specified named users to access the server
+      (in this case, the Impala server). This feature is crucial for any production deployment,
+      to prevent misuse, tampering, or excessive load on the server. Impala uses LDAP for
+      authentication, verifying the credentials of each user who connects through
+      <cmdname>impala-shell</cmdname>, Hue, a Business Intelligence tool, JDBC or ODBC
+      application, and so on.
+    </p>
 
     <note conref="../shared/impala_common.xml#common/authentication_vs_authorization"/>
 
+    <p outputclass="toc inpage"/>
+
     <p>
       An alternative form of authentication you can use is Kerberos, described in
       <xref href="impala_kerberos.xml#kerberos"/>.
     </p>
 
-    <p outputclass="toc inpage"/>
-
   </conbody>
 
   <concept id="ldap_prereqs">
 
     <title>Requirements for Using Impala with LDAP</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Requirements"/>
-      <data name="Category" value="Planning"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Requirements"/>
+        <data name="Category" value="Planning"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
       <p rev="1.4.0">
-        Authentication against LDAP servers is available in Impala 1.2.2 and higher. Impala 1.4.0 adds support for
-        secure LDAP authentication through SSL and TLS.
+        Authentication against LDAP servers is available in Impala 1.2.2 and higher. Impala
+        1.4.0 adds support for secure LDAP authentication through SSL and TLS.
       </p>
 
       <p>
-        The Impala LDAP support lets you use Impala with systems such as Active Directory that use LDAP behind the
-        scenes.
+        The Impala LDAP support lets you use Impala with systems such as Active Directory that
+        use LDAP behind the scenes.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="ldap_client_server">
 
-    <title>Client-Server Considerations for LDAP</title>
+    <title>Consideration for Connections Between Impala Components</title>
 
     <conbody>
 
       <p>
-        Only client-&gt;Impala connections can be authenticated by LDAP.
+        Only client-Impala connections can be authenticated by LDAP.
+      </p>
+
+      <p>
+        You must use the Kerberos authentication mechanism for connections between internal
+        Impala components, such as between the <cmdname>impalad</cmdname>,
+        <cmdname>statestored</cmdname>, and <cmdname>catalogd</cmdname> daemons. See
+        <xref
+          href="impala_kerberos.xml#kerberos"/> on how to set up Kerberos for
+        Impala.
       </p>
 
-      <p> You must use the Kerberos authentication mechanism for connections
-        between internal Impala components, such as between the
-          <cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>, and
-          <cmdname>catalogd</cmdname> daemons. See <xref
-          href="impala_kerberos.xml#kerberos" /> on how to set up Kerberos for
-        Impala. </p>
     </conbody>
+
   </concept>
 
   <concept id="ldap_config">
 
-    <title>Server-Side LDAP Setup</title>
+    <title>Enabling LDAP in Command Line Interface</title>
 
     <conbody>
 
       <p>
-        These requirements apply on the server side when configuring and starting Impala:
+        To enable LDAP authentication, start the <cmdname>impalad</cmdname> with the following
+        startup options for:
       </p>
 
-      <p>
-        To enable LDAP authentication, set the following startup options for <cmdname>impalad</cmdname>:
-      </p>
+      <dl>
+        <dlentry>
+
+          <dt>
+            <codeph>--enable_ldap_auth</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Enables LDAP-based authentication between the client and Impala.
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_uri</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Sets the URI of the LDAP server to use. Typically, the URI is prefixed with
+              <codeph>ldap://</codeph>. You can specify secure SSL-based LDAP transport by using
+              the prefix <codeph>ldaps://</codeph>. The URI can optionally specify the port, for
+              example: <codeph>ldap://ldap_server.example.com:389</codeph> or
+              <codeph>ldaps://ldap_server.example.com:636</codeph>. (389 and 636 are the default
+              ports for non-SSL and SSL LDAP connections, respectively.)
+            </p>
+          </dd>
+
+        </dlentry>
+      </dl>
 
-      <ul>
-        <li>
-          <codeph>--enable_ldap_auth</codeph> enables LDAP-based authentication between the client and Impala.
-        </li>
-
-        <li rev="1.4.0">
-          <codeph>--ldap_uri</codeph> sets the URI of the LDAP server to use. Typically, the URI is prefixed with
-          <codeph>ldap://</codeph>. In Impala 1.4.0 and higher, you can specify secure SSL-based LDAP transport by
-          using the prefix <codeph>ldaps://</codeph>. The URI can optionally specify the port, for example:
-          <codeph>ldap://ldap_server.example.com:389</codeph> or
-          <codeph>ldaps://ldap_server.example.com:636</codeph>. (389 and 636 are the default ports for non-SSL and
-          SSL LDAP connections, respectively.)
-        </li>
-
-<!-- Some amount of this bullet could be conref'ed. Similar but not identical bullet occurs later under TLS. -->
-
-        <li rev="1.4.0">
-          For <codeph>ldaps://</codeph> connections secured by SSL,
-          <codeph>--ldap_ca_certificate="<varname>/path/to/certificate/pem</varname>"</codeph> specifies the
-          location of the certificate in standard <codeph>.PEM</codeph> format. Store this certificate on the local
-          filesystem, in a location that only the <codeph>impala</codeph> user and other trusted users can read.
-        </li>
-
-<!-- Per Henry: not for public consumption.
-<li>
-  If you need to provide a custom SASL configuration,
-  set <codeph>- -ldap_manual_config</codeph> to bypass all the automatic configuration.
-</li>
--->
-      </ul>
     </conbody>
+
   </concept>
 
   <concept id="ldap_bind_strings">
@@ -153,42 +163,78 @@ under the License.
     <conbody>
 
       <p>
-        When Impala connects to LDAP it issues a bind call to the LDAP server to authenticate as the connected
-        user. Impala clients, including the Impala shell, provide the short name of the user to Impala. This is
-        necessary so that Impala can use Sentry for role-based access, which uses short names.
+        When Impala connects to LDAP it issues a bind call to the LDAP server to authenticate as
+        the connected user. Impala clients, including the Impala shell, provide the short name
+        of the user to Impala. This is necessary so that Impala can use Sentry for role-based
+        access, which uses short names.
       </p>
 
       <p>
-        However, LDAP servers often require more complex, structured usernames for authentication. Impala supports
-        three ways of transforming the short name (for example, <codeph>'henry'</codeph>) to a more complicated
-        string. If necessary, specify one of the following configuration options
-        when starting the <cmdname>impalad</cmdname> daemon on each DataNode:
+        However, LDAP servers often require more complex, structured usernames for
+        authentication. Impala supports three ways of transforming the short name (for example,
+        <codeph>'henry'</codeph>) to a more complicated string. If necessary, specify one of the
+        following configuration options when starting the <cmdname>impalad</cmdname> daemon.
       </p>
 
-      <ul>
-        <li>
-          <codeph>--ldap_domain</codeph>: Replaces the username with a string
-          <codeph><varname>username</varname>@<varname>ldap_domain</varname></codeph>.
-        </li>
-
-        <li>
-          <codeph>--ldap_baseDN</codeph>: Replaces the username with a <q>distinguished name</q> (DN) of the form:
-          <codeph>uid=<varname>userid</varname>,ldap_baseDN</codeph>. (This is equivalent to a Hive option).
-        </li>
-
-        <li>
-          <codeph>--ldap_bind_pattern</codeph>: This is the most general option, and replaces the username with the
-          string <varname>ldap_bind_pattern</varname> where all instances of the string <codeph>#UID</codeph> are
-          replaced with <varname>userid</varname>. For example, an <codeph>ldap_bind_pattern</codeph> of
-          <codeph>"user=#UID,OU=foo,CN=bar"</codeph> with a username of <codeph>henry</codeph> will construct a
-          bind name of <codeph>"user=henry,OU=foo,CN=bar"</codeph>.
-        </li>
-      </ul>
+      <dl>
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_domain</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Replaces the username with a string
+              <codeph><varname>username</varname>@<varname>ldap_domain</varname></codeph>.
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_baseDN</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Replaces the username with a <q>distinguished name</q> (DN) of the form:
+              <codeph>uid=<varname>userid</varname>,ldap_baseDN</codeph>. (This is equivalent to
+              a Hive option).
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_bind_pattern</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              This is the most general option, and replaces the username with the string
+              <varname>ldap_bind_pattern</varname> where all instances of the string
+              <codeph>#UID</codeph> are replaced with <varname>userid</varname>. For example, an
+              <codeph>ldap_bind_pattern</codeph> of <codeph>"user=#UID,OU=foo,CN=bar"</codeph>
+              with a username of <codeph>henry</codeph> will construct a bind name of
+              <codeph>"user=henry,OU=foo,CN=bar"</codeph>.
+            </p>
+          </dd>
+
+        </dlentry>
+      </dl>
 
       <p>
-        These options are mutually exclusive; Impala does not start if more than one of these options is specified.
+        The above options are mutually exclusive, and Impala does not start if more than one of
+        these options are specified.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="ldap_security">
@@ -198,9 +244,9 @@ under the License.
     <conbody>
 
       <p>
-        To avoid sending credentials over the wire in cleartext, you must configure a secure connection between
-        both the client and Impala, and between Impala and the LDAP server. The secure connection could use SSL or
-        TLS.
+        To avoid sending credentials over the wire in cleartext, you must configure a secure
+        connection between both the client and Impala, and between Impala and the LDAP server.
+        The secure connection could use SSL or TLS.
       </p>
 
       <p>
@@ -208,8 +254,9 @@ under the License.
       </p>
 
       <p>
-        For SSL-enabled LDAP connections, specify a prefix of <codeph>ldaps://</codeph> instead of
-        <codeph>ldap://</codeph>. Also, the default port for SSL-enabled LDAP connections is 636 instead of 389.
+        For SSL-enabled LDAP connections, specify a prefix of <codeph>ldaps://</codeph> instead
+        of <codeph>ldap://</codeph>. Also, the default port for SSL-enabled LDAP connections is
+        636 instead of 389.
       </p>
 
       <p rev="1.4.0">
@@ -217,114 +264,182 @@ under the License.
       </p>
 
       <p>
-        <xref href="http://en.wikipedia.org/wiki/Transport_Layer_Security" scope="external" format="html">TLS</xref>,
-        the successor to the SSL protocol, is supported by most modern LDAP servers. Unlike SSL connections, TLS
-        connections can be made on the same server port as non-TLS connections. To secure all connections using
-        TLS, specify the following flags as startup options to the <cmdname>impalad</cmdname> daemon:
+        <xref href="http://en.wikipedia.org/wiki/Transport_Layer_Security"
+          scope="external" format="html">TLS</xref>,
+        the successor to the SSL protocol, is supported by most modern LDAP servers. Unlike SSL
+        connections, TLS connections can be made on the same server port as non-TLS connections.
+        To secure all connections using TLS, specify the following flags as startup options to
+        the <cmdname>impalad</cmdname> daemon.
       </p>
 
-      <ul>
-        <li>
-          <codeph>--ldap_tls</codeph> tells Impala to start a TLS connection to the LDAP server, and to fail
-          authentication if it cannot be done.
-        </li>
-
-        <li rev="1.4.0">
-          <codeph>--ldap_ca_certificate="<varname>/path/to/certificate/pem</varname>"</codeph> specifies the
-          location of the certificate in standard <codeph>.PEM</codeph> format. Store this certificate on the local
-          filesystem, in a location that only the <codeph>impala</codeph> user and other trusted users can read.
-        </li>
-      </ul>
+      <dl>
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_tls</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Tells Impala to start a TLS connection to the LDAP server, and to fail
+              authentication if it cannot be done.
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            <codeph>--ldap_ca_certificate</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Specifies the location of the certificate in standard <codeph>.PEM</codeph>
+              format. Store this certificate on the local filesystem, in a location that only
+              the <codeph>impala</codeph> user and other trusted users can read.
+            </p>
+          </dd>
+
+        </dlentry>
+      </dl>
+
     </conbody>
+
   </concept>
 
   <concept id="ldap_impala_shell">
 
-    <title>LDAP Authentication for impala-shell Interpreter</title>
+    <title>LDAP Authentication for impala-shell</title>
 
     <conbody>
 
       <p>
         To connect to Impala using LDAP authentication, you specify command-line options to the
-        <cmdname>impala-shell</cmdname> command interpreter and enter the password when prompted:
+        <cmdname>impala-shell</cmdname> command interpreter and enter the password when
+        prompted:
       </p>
 
-      <ul>
-        <li>
-          <codeph>-l</codeph> enables LDAP authentication.
-        </li>
+      <dl>
+        <dlentry>
 
-        <li>
-          <codeph>-u</codeph> sets the user. Per Active Directory, the user is the short username, not the full
-          LDAP distinguished name. If your LDAP settings include a search base, use the
-          <codeph>--ldap_bind_pattern</codeph> on the <cmdname>impalad</cmdname> daemon to translate the short user
-          name from <cmdname>impala-shell</cmdname> automatically to the fully qualified name.
-<!--
-include that as part of the
-username, for example <codeph>username@example.com</codeph>.
--->
-        </li>
+          <dt>
+            <codeph>-l</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Enables LDAP authentication.
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            <codeph>-u</codeph>
+          </dt>
+
+          <dd>
+            <p>
+              Sets the user. Per Active Directory, the user is the short username, not the full
+              LDAP distinguished name. If your LDAP settings include a search base, use the
+              <codeph>--ldap_bind_pattern</codeph> on the <cmdname>impalad</cmdname> daemon to
+              translate the short user name from <cmdname>impala-shell</cmdname> automatically
+              to the fully qualified name.
+            </p>
+          </dd>
 
-        <li>
-          <cmdname>impala-shell</cmdname> automatically prompts for the password.
-        </li>
-      </ul>
+        </dlentry>
+      </dl>
 
       <p>
-        For the full list of available <cmdname>impala-shell</cmdname> options, see
-        <xref href="impala_shell_options.xml#shell_options"/>.
+        <cmdname>impala-shell</cmdname> automatically prompts for the password.
       </p>
 
       <p>
-        <b>LDAP authentication for JDBC applications:</b> See <xref href="impala_jdbc.xml#impala_jdbc"/> for the
-        format to use with the JDBC connection string for servers using LDAP authentication.
+        See <xref href="impala_jdbc.xml#impala_jdbc"/> for the format to use with the JDBC
+        connection string for servers using LDAP authentication.
       </p>
+
     </conbody>
+
   </concept>
+
   <concept id="ldap_impala_hue">
+
     <title>Enabling LDAP for Impala in Hue</title>
+
     <prolog>
       <metadata>
         <data name="Category" value="Hue"/>
       </metadata>
     </prolog>
+
     <conbody>
+
       <section id="ldap_impala_hue_cmdline">
-        <title>Enabling LDAP for Impala in Hue Using the Command Line</title>
-        <p>LDAP authentication for the Impala app in Hue can be enabled by
-          setting the following properties under the <codeph>[impala]</codeph>
-          section in <codeph>hue.ini</codeph>. <table id="ldap_impala_hue_configs">
-            <tgroup cols="2">
-              <colspec colname="1" colwidth="1*" />
-              <colspec colname="2" colwidth="2*" />
-              <tbody>
-                <row>
-                  <entry><codeph>auth_username</codeph></entry>
-                  <entry>LDAP username of Hue user to be authenticated.</entry>
-                </row>
-                <row>
-                  <entry><codeph>auth_password</codeph></entry>
-                  <entry>
-                    <p>LDAP password of Hue user to be authenticated.</p>
-                  </entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </table>These login details are only used by Impala to authenticate to
-          LDAP. The Impala service trusts Hue to have already validated the user
-          being impersonated, rather than simply passing on the credentials.</p>
+
+        <title>Enabling LDAP for Impala in Hue in the Command Line Interface</title>
+
+        <p>
+          LDAP authentication for the Impala app in Hue can be enabled by setting the following
+          properties under the <codeph>[impala]</codeph> section in <codeph>hue.ini</codeph>.
+          <dl>
+            <dlentry>
+
+              <dt>
+                <codeph>auth_username</codeph>
+              </dt>
+
+              <dd>
+                <p>
+                  LDAP username of Hue user to be authenticated.
+                </p>
+              </dd>
+
+            </dlentry>
+
+            <dlentry>
+
+              <dt>
+                <codeph>auth_password</codeph>
+              </dt>
+
+              <dd>
+                <p>
+                  LDAP password of Hue user to be authenticated.
+                </p>
+              </dd>
+
+            </dlentry>
+          </dl>
+          These login details are only used by Impala to authenticate to LDAP. The Impala
+          service trusts Hue to have already validated the user being impersonated, rather than
+          simply passing on the credentials.
+        </p>
+
       </section>
+
     </conbody>
+
   </concept>
 
   <concept id="ldap_delegation">
+
     <title>Enabling Impala Delegation for LDAP Users</title>
+
     <conbody>
+
       <p>
-        See <xref href="impala_delegation.xml#delegation"/> for details about the delegation feature
-        that lets certain users submit queries using the credentials of other users.
+        See <xref href="impala_delegation.xml#delegation"/> for details about the delegation
+        feature that lets certain users submit queries using the credentials of other users.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="ldap_restrictions">
@@ -334,8 +449,12 @@ username, for example <codeph>username@example.com</codeph>.
     <conbody>
 
       <p>
-        The LDAP support is preliminary. It currently has only been tested against Active Directory.
+        The LDAP support is preliminary. It currently has only been tested against Active
+        Directory.
       </p>
+
     </conbody>
+
   </concept>
+
 </concept>

[impala] 01/05: IMPALA-8170: [DOCS] Added a section on load balancing proxy with TLS

Posted by ar...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 13be8cd8b510cffd4533918c4526733e89a6f26b
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Wed Feb 6 15:28:31 2019 -0800

    IMPALA-8170: [DOCS] Added a section on load balancing proxy with TLS
    
    Change-Id: I92185b456623e08841bc27b5dbbe09ace99294aa
    Reviewed-on: http://gerrit.cloudera.org:8080/12388
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Michael Ho <kw...@cloudera.com>
---
 docs/topics/impala_proxy.xml | 444 +++++++++++++++++++++++++++----------------
 1 file changed, 281 insertions(+), 163 deletions(-)

diff --git a/docs/topics/impala_proxy.xml b/docs/topics/impala_proxy.xml
index ae0887a..453c485 100644
--- a/docs/topics/impala_proxy.xml
+++ b/docs/topics/impala_proxy.xml
@@ -21,7 +21,13 @@ under the License.
 <concept id="proxy">
 
   <title>Using Impala through a Proxy for High Availability</title>
-  <titlealts audience="PDF"><navtitle>Load-Balancing Proxy for HA</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>Load-Balancing Proxy for HA</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="High Availability"/>
@@ -37,13 +43,14 @@ under the License.
   <conbody>
 
     <p>
-      For most clusters that have multiple users and production availability requirements, you might set up a proxy
-      server to relay requests to and from Impala.
+      For most clusters that have multiple users and production availability requirements, you
+      might set up a proxy server to relay requests to and from Impala.
     </p>
 
     <p>
-      Currently, the Impala statestore mechanism does not include such proxying and load-balancing features. Set up
-      a software package of your choice to perform these functions.
+      Currently, the Impala statestore mechanism does not include such proxying and
+      load-balancing features. Set up a software package of your choice to perform these
+      functions.
     </p>
 
     <note>
@@ -57,11 +64,12 @@ under the License.
   <concept id="proxy_overview">
 
     <title>Overview of Proxy Usage and Load Balancing for Impala</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Concepts"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Concepts"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
@@ -71,84 +79,85 @@ under the License.
 
       <ul>
         <li>
-          Applications connect to a single well-known host and port, rather than keeping track of the hosts where
-          the <cmdname>impalad</cmdname> daemon is running.
+          Applications connect to a single well-known host and port, rather than keeping track
+          of the hosts where the <cmdname>impalad</cmdname> daemon is running.
         </li>
 
         <li>
-          If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable, application connection
-          requests still succeed because you always connect to the proxy server rather than a specific host running
-          the <cmdname>impalad</cmdname> daemon.
+          If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable,
+          application connection requests still succeed because you always connect to the proxy
+          server rather than a specific host running the <cmdname>impalad</cmdname> daemon.
         </li>
 
         <li>
-          The coordinator node for each Impala query potentially requires
-          more memory and CPU cycles than the other nodes that process the
-          query. The proxy server can issue queries so that each connection uses
-          a different coordinator node. This load-balancing technique lets the
-          Impala nodes share this additional work, rather than concentrating it
-          on a single machine.
+          The coordinator node for each Impala query potentially requires more memory and CPU
+          cycles than the other nodes that process the query. The proxy server can issue queries
+          so that each connection uses a different coordinator node. This load-balancing
+          technique lets the Impala nodes share this additional work, rather than concentrating
+          it on a single machine.
         </li>
       </ul>
 
       <p>
-        The following setup steps are a general outline that apply to any load-balancing proxy software:
+        The following setup steps are a general outline that apply to any load-balancing proxy
+        software:
       </p>
 
       <ol>
         <li>
-          Select and download the load-balancing proxy software or other
-          load-balancing hardware appliance. It should only need to be installed
-          and configured on a single host, typically on an edge node. Pick a
-          host other than the DataNodes where <cmdname>impalad</cmdname> is
-          running, because the intention is to protect against the possibility
-          of one or more of these DataNodes becoming unavailable.
+          Select and download the load-balancing proxy software or other load-balancing hardware
+          appliance. It should only need to be installed and configured on a single host,
+          typically on an edge node. Pick a host other than the DataNodes where
+          <cmdname>impalad</cmdname> is running, because the intention is to protect against the
+          possibility of one or more of these DataNodes becoming unavailable.
         </li>
 
         <li>
-          Configure the load balancer (typically by editing a configuration file).
-          In particular:
+          Configure the load balancer (typically by editing a configuration file). In
+          particular:
           <ul>
             <li>
-              Set up a port that the load balancer will listen on to relay
-              Impala requests back and forth. </li>
+              Set up a port that the load balancer will listen on to relay Impala requests back
+              and forth.
+            </li>
+
             <li>
-              See <xref href="#proxy_balancing" format="dita"/> for load
-              balancing algorithm options.
+              See <xref href="#proxy_balancing" format="dita"/> for load balancing algorithm
+              options.
             </li>
+
             <li>
-              For Kerberized clusters, follow the instructions in <xref
+              For Kerberized clusters, follow the instructions in
+              <xref
                 href="impala_proxy.xml#proxy_kerberos"/>.
             </li>
           </ul>
         </li>
 
         <li>
-          If you are using Hue or JDBC-based applications, you typically set
-          up load balancing for both ports 21000 and 21050, because these client
-          applications connect through port 21050 while the
-            <cmdname>impala-shell</cmdname> command connects through port
-          21000. See <xref href="impala_ports.xml#ports"/> for when to use port
-          21000, 21050, or another value depending on what type of connections
-          you are load balancing.
+          If you are using Hue or JDBC-based applications, you typically set up load balancing
+          for both ports 21000 and 21050, because these client applications connect through port
+          21050 while the <cmdname>impala-shell</cmdname> command connects through port 21000.
+          See <xref href="impala_ports.xml#ports"/> for when to use port 21000, 21050, or
+          another value depending on what type of connections you are load balancing.
         </li>
 
         <li>
-          Run the load-balancing proxy server, pointing it at the configuration file that you set up.
+          Run the load-balancing proxy server, pointing it at the configuration file that you
+          set up.
         </li>
 
         <li>
-          For any scripts, jobs, or configuration settings for applications
-          that formerly connected to a specific DataNode to run Impala SQL
-          statements, change the connection information (such as the
-            <codeph>-i</codeph> option in <cmdname>impala-shell</cmdname>) to
-          point to the load balancer instead.
+          For any scripts, jobs, or configuration settings for applications that formerly
+          connected to a specific DataNode to run Impala SQL statements, change the connection
+          information (such as the <codeph>-i</codeph> option in
+          <cmdname>impala-shell</cmdname>) to point to the load balancer instead.
         </li>
       </ol>
 
       <note>
-        The following sections use the HAProxy software as a representative example of a load balancer
-        that you can use with Impala.
+        The following sections use the HAProxy software as a representative example of a load
+        balancer that you can use with Impala.
       </note>
 
     </conbody>
@@ -156,106 +165,126 @@ under the License.
   </concept>
 
   <concept id="proxy_balancing" rev="">
+
     <title>Choosing the Load-Balancing Algorithm</title>
+
     <conbody>
+
       <p>
-        Load-balancing software offers a number of algorithms to distribute requests.
-        Each algorithm has its own characteristics that make it suitable in some situations
-        but not others.
+        Load-balancing software offers a number of algorithms to distribute requests. Each
+        algorithm has its own characteristics that make it suitable in some situations but not
+        others.
       </p>
 
       <dl>
         <dlentry>
-          <dt>Leastconn</dt>
+
+          <dt>
+            Leastconn
+          </dt>
+
           <dd>
-            Connects sessions to the coordinator with the fewest connections,
-            to balance the load evenly. Typically used for workloads consisting
-            of many independent, short-running queries. In configurations with
-            only a few client machines, this setting can avoid having all
-            requests go to only a small set of coordinators.
+            Connects sessions to the coordinator with the fewest connections, to balance the
+            load evenly. Typically used for workloads consisting of many independent,
+            short-running queries. In configurations with only a few client machines, this
+            setting can avoid having all requests go to only a small set of coordinators.
           </dd>
+
           <dd>
             Recommended for Impala with F5.
           </dd>
+
         </dlentry>
+
         <dlentry>
-          <dt>Source IP Persistence</dt>
+
+          <dt>
+            Source IP Persistence
+          </dt>
+
           <dd>
             <p>
-              Sessions from the same IP address always go to the same
-              coordinator. A good choice for Impala workloads containing a mix
-              of queries and DDL statements, such as <codeph>CREATE TABLE</codeph>
-              and <codeph>ALTER TABLE</codeph>. Because the metadata changes from
-              a DDL statement take time to propagate across the cluster, prefer
-              to use the Source IP Persistence in this case. If you are unable
-              to choose Source IP Persistence, run the DDL and subsequent queries
-              that depend on the results of the DDL through the same session,
-              for example by running <codeph>impala-shell -f <varname>script_file</varname></codeph>
-              to submit several statements through a single session.
+              Sessions from the same IP address always go to the same coordinator. A good choice
+              for Impala workloads containing a mix of queries and DDL statements, such as
+              <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>. Because the
+              metadata changes from a DDL statement take time to propagate across the cluster,
+              prefer to use the Source IP Persistence in this case. If you are unable to choose
+              Source IP Persistence, run the DDL and subsequent queries that depend on the
+              results of the DDL through the same session, for example by running
+              <codeph>impala-shell -f <varname>script_file</varname></codeph> to submit several
+              statements through a single session.
             </p>
           </dd>
+
           <dd>
             <p>
               Required for setting up high availability with Hue.
             </p>
           </dd>
+
         </dlentry>
+
         <dlentry>
-          <dt>Round-robin</dt>
+
+          <dt>
+            Round-robin
+          </dt>
+
           <dd>
             <p>
-              Distributes connections to all coordinator nodes.
-              Typically not recommended for Impala.
+              Distributes connections to all coordinator nodes. Typically not recommended for
+              Impala.
             </p>
           </dd>
+
         </dlentry>
       </dl>
 
       <p>
-        You might need to perform benchmarks and load testing to determine
-        which setting is optimal for your use case. Always set up using two
-        load-balancing algorithms: Source IP Persistence for Hue and Leastconn
-        for others.
+        You might need to perform benchmarks and load testing to determine which setting is
+        optimal for your use case. Always set up using two load-balancing algorithms: Source IP
+        Persistence for Hue and Leastconn for others.
       </p>
 
     </conbody>
+
   </concept>
 
   <concept id="proxy_kerberos">
 
     <title>Special Proxy Considerations for Clusters Using Kerberos</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Security"/>
-      <data name="Category" value="Kerberos"/>
-      <data name="Category" value="Authentication"/>
-      <data name="Category" value="Proxy"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Security"/>
+        <data name="Category" value="Kerberos"/>
+        <data name="Category" value="Authentication"/>
+        <data name="Category" value="Proxy"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
       <p>
-        In a cluster using Kerberos, applications check host credentials to
-        verify that the host they are connecting to is the same one that is
-        actually processing the request, to prevent man-in-the-middle attacks.
+        In a cluster using Kerberos, applications check host credentials to verify that the host
+        they are connecting to is the same one that is actually processing the request, to
+        prevent man-in-the-middle attacks.
       </p>
+
       <p>
-        In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower
-        versions, once you enable a proxy server in a Kerberized cluster, users
-        will not be able to connect to individual impala daemons directly from
-        impala-shell.
+        In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower versions, once you
+        enable a proxy server in a Kerberized cluster, users will not be able to connect to
+        individual impala daemons directly from impala-shell.
       </p>
 
       <p>
-        In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher,
-        if you enable a proxy server in a Kerberized cluster, users have an
-        option to connect to Impala daemons directly from
-          <cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
-          <codeph>--kerberos_host_fqdn</codeph> option when you start
-          <cmdname>impala-shell</cmdname>. This option can be used for testing or
-        troubleshooting purposes, but not recommended for live production
-        environments as it defeats the purpose of a load balancer/proxy.
+        In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher, if you enable a
+        proxy server in a Kerberized cluster, users have an option to connect to Impala daemons
+        directly from <cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
+        <codeph>--kerberos_host_fqdn</codeph> option when you start
+        <cmdname>impala-shell</cmdname>. This option can be used for testing or troubleshooting
+        purposes, but not recommended for live production environments as it defeats the purpose
+        of a load balancer/proxy.
       </p>
 
       <p>
@@ -266,77 +295,76 @@ impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
       </p>
 
       <p>
-        Alternatively, with the fully qualified
-        configurations:
+        Alternatively, with the fully qualified configurations:
 <codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock>
       </p>
+
       <p>
-        See <xref href="impala_shell_options.xml#shell_options"/> for
-        information about the option.
+        See <xref href="impala_shell_options.xml#shell_options"/> for information about the
+        option.
       </p>
 
       <p>
-        To clarify that the load-balancing proxy server is legitimate, perform
-        these extra Kerberos setup steps:
+        To clarify that the load-balancing proxy server is legitimate, perform these extra
+        Kerberos setup steps:
       </p>
 
       <ol>
         <li>
           This section assumes you are starting with a Kerberos-enabled cluster. See
-          <xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala with Kerberos. See
-          <xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general steps to set up Kerberos.
+          <xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala
+          with Kerberos. See <xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general
+          steps to set up Kerberos.
         </li>
 
         <li>
-          Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should
-          already have an entry <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in
-          its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host
-          running the <cmdname>impalad</cmdname> daemon.
+          Choose the host you will use for the proxy server. Based on the Kerberos setup
+          procedure, it should already have an entry
+          <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in its
+          keytab. If not, go back over the initial Kerberos configuration steps for the keytab
+          on each host running the <cmdname>impalad</cmdname> daemon.
         </li>
 
         <li>
-          Copy the keytab file from the proxy host to all other hosts in the cluster that run the
-          <cmdname>impalad</cmdname> daemon. (For optimal performance, <cmdname>impalad</cmdname> should be running
-          on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
+          Copy the keytab file from the proxy host to all other hosts in the cluster that run
+          the <cmdname>impalad</cmdname> daemon. (For optimal performance,
+          <cmdname>impalad</cmdname> should be running on all DataNodes in the cluster.) Put the
+          keytab file in a secure location on each of these other hosts.
         </li>
 
         <li>
-          Add an entry <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to the keytab on each
-          host running the <cmdname>impalad</cmdname> daemon.
+          Add an entry
+          <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to
+          the keytab on each host running the <cmdname>impalad</cmdname> daemon.
         </li>
 
         <li>
-
-         For each impalad node, merge the existing keytab with the proxy’s keytab using
+          For each impalad node, merge the existing keytab with the proxy’s keytab using
           <cmdname>ktutil</cmdname>, producing a new keytab file. For example:
-          <codeblock>$ ktutil
+<codeblock>$ ktutil
   ktutil: read_kt proxy.keytab
   ktutil: read_kt impala.keytab
   ktutil: write_kt proxy_impala.keytab
   ktutil: quit</codeblock>
-
         </li>
 
         <li>
-
           To verify that the keytabs are merged, run the command:
 <codeblock>
 klist -k <varname>keytabfile</varname>
 </codeblock>
-          which lists the credentials for both <codeph>principal</codeph> and <codeph>be_principal</codeph> on
-          all nodes.
+          which lists the credentials for both <codeph>principal</codeph> and
+          <codeph>be_principal</codeph> on all nodes.
         </li>
 
-
         <li>
-
-          Make sure that the <codeph>impala</codeph> user has permission to read this merged keytab file.
-
+          Make sure that the <codeph>impala</codeph> user has permission to read this merged
+          keytab file.
         </li>
 
         <li>
-          Change the following configuration settings for each host in the cluster that participates
-          in the load balancing:
+          Change the following configuration settings for each host in the cluster that
+          participates in the load balancing:
           <ul>
             <li>
               In the <cmdname>impalad</cmdname> option definition, add:
@@ -346,51 +374,139 @@ klist -k <varname>keytabfile</varname>
   --keytab_file=<i>path_to_merged_keytab</i>
 </codeblock>
               <note>
-                Every host has different <codeph>--be_principal</codeph> because the actual hostname
-                is different on each host.
-
-                Specify the fully qualified domain name (FQDN) for the proxy host, not the IP
-                address. Use the exact FQDN as returned by a reverse DNS lookup for the associated
-                IP address.
-
+                Every host has different <codeph>--be_principal</codeph> because the actual
+                hostname is different on each host. Specify the fully qualified domain name
+                (FQDN) for the proxy host, not the IP address. Use the exact FQDN as returned by
+                a reverse DNS lookup for the associated IP address.
               </note>
             </li>
 
             <li>
-              Modify the startup options. See <xref href="impala_config_options.xml#config_options"/> for the procedure to modify the startup
-              options.
+              Modify the startup options. See
+              <xref href="impala_config_options.xml#config_options"/> for the procedure to
+              modify the startup options.
             </li>
           </ul>
         </li>
 
         <li>
-          Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname> daemons on all
-          hosts in the cluster, as well as the <cmdname>statestored</cmdname> and <cmdname>catalogd</cmdname>
-          daemons.
+          Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname>
+          daemons on all hosts in the cluster, as well as the <cmdname>statestored</cmdname> and
+          <cmdname>catalogd</cmdname> daemons.
         </li>
-
       </ol>
 
     </conbody>
 
   </concept>
 
+  <concept id="proxy_tls">
+
+    <title>Special Proxy Considerations for TLS/SSL Enabled Clusters</title>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Security"/>
+        <data name="Category" value="TLS"/>
+        <data name="Category" value="Authentication"/>
+        <data name="Category" value="Proxy"/>
+      </metadata>
+    </prolog>
+
+    <conbody>
+
+      <p>
+        When TLS/SSL is enabled for Impala, the client application, whether impala-shell, Hue,
+        or something else, expects the certificate common name (CN) to match the hostname that
+        it is connected to. With no load balancing proxy server, the hostname and certificate CN
+        are both that of the <codeph>impalad</codeph> instance. However, with a proxy server,
+        the certificate presented by the <codeph>impalad</codeph> instance does not match the
+        load balancing proxy server hostname. If you try to load-balance a TLS/SSL-enabled
+        Impala installation without additional configuration, you see a certificate mismatch
+        error when a client attempts to connect to the load balancing proxy host.
+      </p>
+
+      <p>
+        You can configure a proxy server in several ways to load balance TLS/SSL enabled Impala:
+      </p>
+
+      <dl>
+        <dlentry>
+
+          <dt>
+            Client/Server SSL
+          </dt>
+
+          <dd>
+            In this configuration, the proxy server presents an SSL certificate to the client,
+            decrypts the client request, then re-encrypts the request before sending it to the
+            backend <codeph>impalad</codeph>. The client and server certificates can be managed
+            separately. The request or resulting payload is encrypted in transit at all times.
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            TLS/SSL Passthrough
+          </dt>
+
+          <dd>
+            In this configuration, traffic passes through to the backend
+            <codeph>impalad</codeph> instance with no interaction from the load balancing proxy
+            server. Traffic is still encrypted end-to-end.
+          </dd>
+
+          <dd>
+            The same server certificate, utilizing either wildcard or Subject Alternate Name
+            (SAN), must be installed on each <codeph>impalad</codeph> instance.
+          </dd>
+
+        </dlentry>
+
+        <dlentry>
+
+          <dt>
+            TLS/SSL Offload
+          </dt>
+
+          <dd>
+            In this configuration, all traffic is decrypted on the load balancing proxy server,
+            and traffic between the backend <codeph>impalad</codeph> instances is unencrypted.
+            This configuration presumes that cluster hosts reside on a trusted network and only
+            external client-facing communication need to be encrypted in-transit.
+          </dd>
+
+        </dlentry>
+      </dl>
+
+      <p>
+        Refer to your load balancer documentation for the steps to set up Impala and the load
+        balancer using one of the options above.
+      </p>
+
+    </conbody>
+
+  </concept>
+
   <concept id="tut_proxy">
 
     <title>Example of Configuring HAProxy Load Balancer for Impala</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Configuring"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Configuring"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
       <p>
         If you are not already using a load-balancing proxy, you can experiment with
-        <xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a free, open source load
-        balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise
-        Linux system.
+        <xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a
+        free, open source load balancer. This example shows how you might install and configure
+        that load balancer on a Red Hat Enterprise Linux system.
       </p>
 
       <ul>
@@ -402,24 +518,26 @@ klist -k <varname>keytabfile</varname>
 
         <li>
           <p>
-            Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See the following section
-            for a sample configuration file.
+            Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See
+            the following section for a sample configuration file.
           </p>
         </li>
 
         <li>
           <p>
-            Run the load balancer (on a single host, preferably one not running <cmdname>impalad</cmdname>):
+            Run the load balancer (on a single host, preferably one not running
+            <cmdname>impalad</cmdname>):
           </p>
 <codeblock>/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg</codeblock>
         </li>
 
         <li>
           <p>
-            In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect to the listener
-            port of the proxy host, rather than port 21000 or 21050 on a host actually running <cmdname>impalad</cmdname>.
-            The sample configuration file sets haproxy to listen on port 25003, therefore you would send all
-            requests to <codeph><varname>haproxy_host</varname>:25003</codeph>.
+            In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect
+            to the listener port of the proxy host, rather than port 21000 or 21050 on a host
+            actually running <cmdname>impalad</cmdname>. The sample configuration file sets
+            haproxy to listen on port 25003, therefore you would send all requests to
+            <codeph><varname>haproxy_host</varname>:25003</codeph>.
           </p>
         </li>
       </ul>
@@ -514,12 +632,12 @@ listen impalajdbc :21051
     server <varname>symbolic_name_7</varname> impala-host-3.example.com:21050 check
     server <varname>symbolic_name_8</varname> impala-host-4.example.com:21050 check
 </codeblock>
+
       <note type="important">
-        Hue requires the <codeph>check</codeph> option at end of each line in
-        the above file to ensure HAProxy can detect any unreachable Impalad
-        server, and failover can be successful. Without the TCP check, you may hit
-        an error when the <cmdname>impalad</cmdname> daemon to which Hue tries to
-        connect is down.
+        Hue requires the <codeph>check</codeph> option at end of each line in the above file to
+        ensure HAProxy can detect any unreachable Impalad server, and failover can be
+        successful. Without the TCP check, you may hit an error when the
+        <cmdname>impalad</cmdname> daemon to which Hue tries to connect is down.
       </note>
 
       <note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/>