You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@impala.apache.org by to...@apache.org on 2019/05/21 04:37:27 UTC

[impala] branch master updated (5f1b00c -> 7af981f)

This is an automated email from the ASF dual-hosted git repository.

todd pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git.


    from 5f1b00c  Fix docs for catalogd automatic invalidate flags
     new aaf1cf0  [DOCS] Added back the note about Deflate not supported for text files.
     new 2d605cc  IMPALA-8490: [DOCS] Describe the S3 file handle caching feature
     new e21764b  IMPALA-8116: [DOCS] A new doc for Impala Scaling Limits
     new 7af981f  fe: clean up POM and improve m2e integration

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 docs/impala.ditamap                                |   1 +
 docs/topics/impala_file_formats.xml                |   4 +-
 docs/topics/impala_scalability.xml                 | 177 ++++++----
 docs/topics/impala_scaling_limits.xml              | 364 +++++++++++++++++++++
 fe/pom.xml                                         |  84 ++---
 .../java/org/apache/impala/util/NativeLibUtil.java |  46 ++-
 impala-parent/pom.xml                              |   5 -
 7 files changed, 555 insertions(+), 126 deletions(-)
 create mode 100644 docs/topics/impala_scaling_limits.xml

[impala] 04/04: fe: clean up POM and improve m2e integration

Posted by to...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

todd pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 7af981f7afdd25c6e0ed7e86c21bbc7710098a9e
Author: Todd Lipcon <to...@apache.org>
AuthorDate: Wed May 15 13:40:13 2019 -0700

    fe: clean up POM and improve m2e integration
    
    This cleans up a few items in the FE pom:
    
    - removes old system properties like 'beeswax_port', 'impalad', and
      'use_external_impalad' which have been dead code since 2013.
    
    - adds appropriate configuration for m2e to avoid some errors upon
      importing the project (configures various Maven plugins to ignore or
      execute via M2E)
    
    - sets up m2e to generate classes into target/eclipse-classes so that
      auto-rebuilds don't break running daemons (carrying over the same idea
      from the old eclipse:eclipse support)
    
    Additionally, this stops passing the be build directory to
    java.library.path via a system property, since that system property
    doesn't get forwarded into the Eclipse test run configurations. Instead,
    I added some code to NativeLibUtil to locate the be build directory and
    search the appropriate path explicitly.
    
    Change-Id: Ifdad9610d858d488eb95cb947ed123fe1ebfe62a
    Reviewed-on: http://gerrit.cloudera.org:8080/13365
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Todd Lipcon <to...@apache.org>
---
 fe/pom.xml                                         | 84 +++++++++++-----------
 .../java/org/apache/impala/util/NativeLibUtil.java | 46 +++++++++---
 impala-parent/pom.xml                              |  5 --
 3 files changed, 78 insertions(+), 57 deletions(-)

diff --git a/fe/pom.xml b/fe/pom.xml
index d6c9908..fdf3272 100644
--- a/fe/pom.xml
+++ b/fe/pom.xml
@@ -431,7 +431,12 @@ under the License.
     </plugins>
   </reporting>
 
+  <properties>
+    <buildOutputDirectory>${project.build.directory}/classes</buildOutputDirectory>
+  </properties>
+
   <build>
+    <outputDirectory>${buildOutputDirectory}</outputDirectory>
     <plugins>
       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
@@ -542,21 +547,7 @@ under the License.
           <trimStackTrace>false</trimStackTrace>
           <reportsDirectory>${surefire.reports.dir}</reportsDirectory>
           <redirectTestOutputToFile>true</redirectTestOutputToFile>
-          <argLine>-Djava.library.path=${java.library.path}:${backend.library.path} ${surefireJacocoArg}</argLine>
-          <systemProperties>
-            <property>
-              <name>testExecutionMode</name>
-              <value>${testExecutionMode}</value>
-            </property>
-            <property>
-              <name>beeswax_port</name>
-              <value>${beeswax_port}</value>
-              <name>impalad</name>
-              <value>${impalad}</value>
-              <name>use_external_impalad</name>
-              <value>${use_external_impalad}</value>
-            </property>
-          </systemProperties>
+          <argLine>${surefireJacocoArg}</argLine>
         </configuration>
       </plugin>
 
@@ -588,31 +579,6 @@ under the License.
       </plugin>
 
       <plugin>
-        <groupId>org.codehaus.mojo</groupId>
-        <artifactId>exec-maven-plugin</artifactId>
-        <version>1.4.0</version>
-        <executions>
-          <execution>
-            <goals>
-              <goal>java</goal>
-            </goals>
-          </execution>
-        </executions>
-          <configuration>
-            <systemProperties>
-              <systemProperty>
-                <key>java.library.path</key>
-                <value>${java.library.path}:${backend.library.path}</value>
-              </systemProperty>
-              <systemProperty>
-                <key>test.hive.testdata</key>
-                <value>${project.basedir}/../testdata/target/AllTypes.txt</value>
-              </systemProperty>
-            </systemProperties>
-          </configuration>
-      </plugin>
-
-      <plugin>
         <groupId>org.jacoco</groupId>
         <artifactId>jacoco-maven-plugin</artifactId>
         <version>0.7.6.201602180812</version>
@@ -732,17 +698,34 @@ under the License.
                     <versionRange>[2.0,)</versionRange>
                     <goals>
                       <goal>copy-dependencies</goal>
+                      <goal>build-classpath</goal>
                     </goals>
                   </pluginExecutionFilter>
                   <action>
                     <ignore></ignore>
                   </action>
                 </pluginExecution>
+                <pluginExecution>
+                  <pluginExecutionFilter>
+                    <groupId>de.jflex</groupId>
+                    <artifactId>maven-jflex-plugin</artifactId>
+                    <versionRange>[1.4.3,)</versionRange>
+                    <goals>
+                      <goal>generate</goal>
+                    </goals>
+                  </pluginExecutionFilter>
+                  <action>
+                    <execute></execute>
+                  </action>
+                </pluginExecution>
               </pluginExecutions>
             </lifecycleMappingMetadata>
           </configuration>
         </plugin>
-        <!-- mvn eclipse:eclipse generates Eclipse .project and .classpath files -->
+        <!-- mvn eclipse:eclipse generates Eclipse .project and .classpath files
+             NOTE: This is a deprecated Maven plugin. It's recommended to use
+             the native Eclipse "import Maven project" functionality (m2e)
+             instead -->
         <plugin>
           <groupId>org.apache.maven.plugins </groupId>
           <artifactId>maven-eclipse-plugin</artifactId>
@@ -1125,6 +1108,25 @@ under the License.
         </plugins>
       </build>
     </profile>
+
+    <!-- Profile which is automatically activated when building from
+         within Eclipse based on the presence of the m2e.version
+         property -->
+    <profile>
+      <id>eclipse-m2e</id>
+      <activation>
+        <property>
+          <name>m2e.version</name>
+        </property>
+      </activation>
+        <!-- By default, we separate Eclipse-built files from Maven-built
+             files. Otherwise, they are both in target/classes, and Eclipse
+             and Maven may clobber each other, complicating attaching to
+             a running process. -->
+        <properties>
+          <buildOutputDirectory>${project.build.directory}/${eclipse.output.directory}</buildOutputDirectory>
+        </properties>
+    </profile>
   </profiles>
 
   <dependencyManagement>
diff --git a/fe/src/main/java/org/apache/impala/util/NativeLibUtil.java b/fe/src/main/java/org/apache/impala/util/NativeLibUtil.java
index 877b36b..ceb1cf6 100644
--- a/fe/src/main/java/org/apache/impala/util/NativeLibUtil.java
+++ b/fe/src/main/java/org/apache/impala/util/NativeLibUtil.java
@@ -18,27 +18,51 @@
 package org.apache.impala.util;
 
 import java.io.File;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.base.Joiner;
 
 public class NativeLibUtil {
+  private final static Logger LOG = LoggerFactory.getLogger(NativeLibUtil.class);
+
   /**
-   * Attempts to load the given library from all paths in java.libary.path.
+   * Attempts to load the given library from all paths in java.libary.path,
+   * as well as the current build directory (assuming we are in a test environment).
    * Throws a RuntimeException if the library was unable to be loaded from
    * any location.
    */
   public static void loadLibrary(String libFileName) {
-    boolean found = false;
-    String javaLibPath = System.getProperty("java.library.path");
-    for (String path: javaLibPath.split(":")) {
+    List<String> candidates = new ArrayList<>(Arrays.asList(
+        System.getProperty("java.library.path").split(":")));
+
+    // Fall back to automatically finding the library in test environments.
+    // This makes it easier to run tests from Eclipse without specially configuring
+    // the Run Configurations.
+    try {
+      String myPath = NativeLibUtil.class.getProtectionDomain()
+          .getCodeSource().getLocation().getPath();
+      if (myPath.toString().endsWith("fe/target/classes/") ||
+          myPath.toString().endsWith("fe/target/eclipse-classes/")) {
+        candidates.add(myPath + "../../../be/build/latest/service/");
+      }
+    } catch (Exception e) {
+      LOG.warn("Unable to get path for NativeLibUtil class", e);
+    }
+
+    for (String path: candidates) {
       File libFile = new File(path + File.separator + libFileName);
       if (libFile.exists()) {
         System.load(libFile.getPath());
-        found = true;
-        break;
+        return;
       }
     }
-    if (!found) {
-      throw new RuntimeException("Failed to load " + libFileName + " from any " +
-          "location in java.library.path (" + javaLibPath + ").");
-    }
+
+    throw new RuntimeException("Failed to load " + libFileName + " from any " +
+        "candidate location:\n" + Joiner.on("\n").join(candidates));
   }
-}
\ No newline at end of file
+}
diff --git a/impala-parent/pom.xml b/impala-parent/pom.xml
index 355bc05..62a6dc5 100644
--- a/impala-parent/pom.xml
+++ b/impala-parent/pom.xml
@@ -30,11 +30,6 @@ under the License.
     <jacoco.skip>true</jacoco.skip>
     <jacoco.data.file>${env.IMPALA_FE_TEST_COVERAGE_DIR}/jacoco.exec</jacoco.data.file>
     <jacoco.report.dir>${env.IMPALA_FE_TEST_COVERAGE_DIR}</jacoco.report.dir>
-    <test.hive.testdata>${project.basedir}/../testdata/target/AllTypes.txt</test.hive.testdata>
-    <backend.library.path>${env.IMPALA_HOME}/be/build/debug/service:${env.IMPALA_HOME}/be/build/release/service</backend.library.path>
-    <beeswax_port>21000</beeswax_port>
-    <impalad>localhost</impalad>
-    <testExecutionMode>reduced</testExecutionMode>
     <hadoop.version>${env.IMPALA_HADOOP_VERSION}</hadoop.version>
     <hive.version>${env.IMPALA_HIVE_VERSION}</hive.version>
     <hive.storage.api.version>2.3.0.${env.IMPALA_HIVE_VERSION}</hive.storage.api.version>

[impala] 02/04: IMPALA-8490: [DOCS] Describe the S3 file handle caching feature

Posted by to...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

todd pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 2d605cc107b307c0699449c777d8a0761f8c6257
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Thu May 16 16:48:17 2019 -0700

    IMPALA-8490: [DOCS] Describe the S3 file handle caching feature
    
    Change-Id: I304a0a033475f2289d8a620448d70b90447e4ee1
    Reviewed-on: http://gerrit.cloudera.org:8080/13357
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Sahil Takiar <st...@cloudera.com>
    Reviewed-by: Alex Rodoni <ar...@cloudera.com>
---
 docs/topics/impala_scalability.xml | 177 +++++++++++++++++++++++--------------
 1 file changed, 109 insertions(+), 68 deletions(-)

diff --git a/docs/topics/impala_scalability.xml b/docs/topics/impala_scalability.xml
index c1264fa..a7a6ca4 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -178,11 +178,6 @@ Memory Usage: Additional Notes
 
     <conbody>
 
-      <p audience="hidden">
-        Details to fill in in future: Impact of <q>load catalog in background</q> option.
-        Changing timeouts.
-      </p>
-
       <p>
         Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized
         for tables containing relatively few, large data files. Schemas containing thousands of
@@ -193,13 +188,16 @@ Memory Usage: Additional Notes
       <note type="important" rev="TSB-168">
         <p>
           Because of a change in the default heap size for the <cmdname>catalogd</cmdname>
-          daemon in <keyword keyref="impala25_full"/> and higher, the following procedure to
-          increase the <cmdname>catalogd</cmdname> memory limit might be required following an
-          upgrade to <keyword keyref="impala25_full"/> even if not needed previously.
+          daemon in <keyword
+            keyref="impala25_full"/> and higher, the following
+          procedure to increase the <cmdname>catalogd</cmdname> memory limit might be required
+          following an upgrade to <keyword keyref="impala25_full"/> even if not needed
+          previously.
         </p>
       </note>
 
-      <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>
+      <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"
+      />
 
     </conbody>
 
@@ -343,8 +341,11 @@ Memory Usage: Additional Notes
 
       <p>
         The buffer pool feature includes some query options that you can fine-tune:
-        <xref keyref="buffer_pool_limit"/>, <xref keyref="default_spillable_buffer_size"/>,
-        <xref keyref="max_row_size"/>, and <xref keyref="min_spillable_buffer_size"/>.
+        <xref keyref="buffer_pool_limit"/>,
+        <xref
+          keyref="default_spillable_buffer_size"/>,
+        <xref keyref="max_row_size"
+        />, and <xref keyref="min_spillable_buffer_size"/>.
       </p>
 
       <p>
@@ -368,7 +369,7 @@ Memory Usage: Additional Notes
 
     <conbody>
 
-      <p></p>
+      <p/>
 
     </conbody>
 
@@ -380,7 +381,7 @@ Memory Usage: Additional Notes
 
     <conbody>
 
-      <p></p>
+      <p/>
 
     </conbody>
 
@@ -410,9 +411,10 @@ Memory Usage: Additional Notes
       <note rev="2.10.0 IMPALA-3200">
         <p>
           In <keyword keyref="impala210"/> and higher, also see
-          <xref keyref="scalability_buffer_pool"/> for changes to Impala memory allocation that
-          might change the details of which queries spill to disk, and how much memory and disk
-          space is involved in the spilling operation.
+          <xref
+            keyref="scalability_buffer_pool"/> for changes to Impala memory
+          allocation that might change the details of which queries spill to disk, and how much
+          memory and disk space is involved in the spilling operation.
         </p>
       </note>
 
@@ -468,13 +470,15 @@ Memory Usage: Additional Notes
         -->
       </ul>
 
-      <p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
+      <p
+        conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
 
       <p>
         <b>How Impala handles scratch disk space for spilling:</b>
       </p>
 
-      <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
+      <p rev="obwl"
+        conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
 
       <p>
         <b>Memory usage for SQL operators:</b>
@@ -547,8 +551,9 @@ Memory Usage: Additional Notes
         Impala 1.4. This feature was extended to cover join queries, aggregation functions, and
         analytic functions in Impala 2.0. The size of the memory work area required by each
         operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
-        <ph rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take advantage of
-        the Impala buffer pool feature and be more predictable and stable in
+        <ph
+          rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take
+        advantage of the Impala buffer pool feature and be more predictable and stable in
         <keyword keyref="impala210_full"/>.</ph>
       </p>
 
@@ -571,9 +576,12 @@ Memory Usage: Additional Notes
               <cmdname>impala-shell</cmdname> interpreter. This data shows the memory usage for
               each host and in total across the cluster. The <codeph>WriteIoBytes</codeph>
               counter reports how much data was written to disk for each operator during the
-              query. (In <keyword keyref="impala29_full"/>, the counter was named
-              <codeph>ScratchBytesWritten</codeph>; in <keyword keyref="impala28_full"/> and
-              earlier, it was named <codeph>BytesWritten</codeph>.)
+              query. (In <keyword
+                keyref="impala29_full"/>, the counter was
+              named <codeph>ScratchBytesWritten</codeph>; in
+              <keyword
+                keyref="impala28_full"/> and earlier, it was named
+              <codeph>BytesWritten</codeph>.)
             </li>
 
             <li>
@@ -601,12 +609,16 @@ Memory Usage: Additional Notes
               available to Impala and reduce the amount of memory required on each node.
             </li>
 
-            <li> Add more memory to the hosts running Impala daemons. </li>
+            <li>
+              Add more memory to the hosts running Impala daemons.
+            </li>
 
             <li>
               On a cluster with resources shared between Impala and other Hadoop components, use
               resource management features to allocate more memory for Impala. See
-              <xref href="impala_resource_management.xml#resource_management"/> for details.
+              <xref
+                href="impala_resource_management.xml#resource_management"/>
+              for details.
             </li>
 
             <li>
@@ -614,8 +626,9 @@ Memory Usage: Additional Notes
               memory-intensive ones, consider using the Impala admission control feature to
               lower the limit on the number of concurrent queries. By spacing out the most
               resource-intensive queries, you can avoid spikes in memory usage and improve
-              overall response times. See <xref href="impala_admission.xml#admission_control"/>
-              for details.
+              overall response times. See
+              <xref
+                href="impala_admission.xml#admission_control"/> for details.
             </li>
 
             <li>
@@ -635,7 +648,9 @@ Memory Usage: Additional Notes
                 <li>
                   Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy
                   being used for the most resource-intensive queries. See
-                  <xref href="impala_explain_plan.xml#perf_explain"/> for details.
+                  <xref href="impala_explain_plan.xml#perf_explain"
+                  /> for
+                  details.
                 </li>
 
                 <li>
@@ -643,7 +658,8 @@ Memory Usage: Additional Notes
                   available, or if it is impractical to keep the statistics up to date for huge
                   or rapidly changing tables, add hints to the most resource-intensive queries
                   to select the right execution strategy. See
-                  <xref href="impala_hints.xml#hints"/> for details.
+                  <xref
+                    href="impala_hints.xml#hints"/> for details.
                 </li>
               </ul>
             </li>
@@ -652,7 +668,9 @@ Memory Usage: Additional Notes
               If your queries experience substantial performance overhead due to spilling,
               enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option
               prevents queries whose memory usage is likely to be exorbitant from spilling to
-              disk. See <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
+              disk. See
+              <xref
+                href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
               for details. As you tune problematic queries using the preceding steps, fewer and
               fewer will be cancelled by this option setting.
             </li>
@@ -952,8 +970,9 @@ these tables, hint the plan or disable this behavior via query options to enable
 
       <p>
         You can use the HDFS caching feature, described in
-        <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to reduce I/O and
-        memory-to-memory copying for frequently accessed tables or partitions.
+        <xref
+          href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to
+        reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
       </p>
 
       <p>
@@ -969,8 +988,10 @@ these tables, hint the plan or disable this behavior via query options to enable
         <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that
         use HDFS caching. This clause allows more than one host to cache the relevant data
         blocks, so the CPU load can be shared, reducing the load on any one host. See
-        <xref href="impala_create_table.xml#create_table"/> and
-        <xref href="impala_alter_table.xml#alter_table"/> for details.
+        <xref
+          href="impala_create_table.xml#create_table"/> and
+        <xref
+          href="impala_alter_table.xml#alter_table"/> for details.
       </p>
 
       <p>
@@ -993,40 +1014,52 @@ these tables, hint the plan or disable this behavior via query options to enable
 
   <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">
 
-    <title>Scalability Considerations for NameNode Traffic with File Handle Caching</title>
+    <title>Scalability Considerations for File Handle Caching</title>
 
     <conbody>
 
       <p>
-        One scalability aspect that affects heavily loaded clusters is the load on the HDFS
-        NameNode from looking up the details as each HDFS file is opened. Impala queries often
-        access many different HDFS files. For example, a query that does a full table scan on a
-        partitioned table may need to read thousands of partitions, each partition containing
-        multiple data files. Accessing each column of a Parquet file also involves a separate
-        <q>open</q> call, further increasing the load on the NameNode. High NameNode overhead
-        can add startup time (that is, increase latency) to Impala queries, and reduce overall
+        One scalability aspect that affects heavily loaded clusters is the load on the metadata
+        layer from looking up the details as each file is opened. On HDFS, that can lead to
+        increased load on the NameNode, and on S3, this can lead to an excessive number of S3
+        metadata requests. For example, a query that does a full table scan on a partitioned
+        table may need to read thousands of partitions, each partition containing multiple data
+        files. Accessing each column of a Parquet file also involves a separate <q>open</q>
+        call, further increasing the load on the NameNode. High NameNode overhead can add
+        startup time (that is, increase latency) to Impala queries, and reduce overall
         throughput for non-Impala workloads that also require accessing HDFS files.
       </p>
 
       <p>
-        In <keyword keyref="impala210_full"/> and higher, you can reduce NameNode overhead by
-        enabling a caching feature for HDFS file handles. Data files that are accessed by
-        different queries, or even multiple times within the same query, can be accessed without
-        a new <q>open</q> call and without fetching the file details again from the NameNode.
+        You can reduce the number of calls made to your file system's metadata layer by enabling
+        the file handle caching feature. Data files that are accessed by different queries, or
+        even multiple times within the same query, can be accessed without a new <q>open</q>
+        call and without fetching the file details multiple times.
       </p>
 
       <p>
-        In Impala 3.2 and higher, file handle caching also applies to remote HDFS file handles.
-        This is controlled by the <codeph>cache_remote_file_handles</codeph> flag for an
-        <codeph>impalad</codeph>. It is recommended that you use the default value of
-        <codeph>true</codeph> as this caching prevents your NameNode from overloading when your
-        cluster has many remote HDFS reads.
-      </p>
+        Impala supports file handle caching for the following file systems:
+        <ul>
+          <li>
+            HDFS in <keyword keyref="impala210_full"/> and higher
+            <p>
+              In Impala 3.2 and higher, file handle caching also applies to remote HDFS file
+              handles. This is controlled by the <codeph>cache_remote_file_handles</codeph> flag
+              for an <codeph>impalad</codeph>. It is recommended that you use the default value
+              of <codeph>true</codeph> as this caching prevents your NameNode from overloading
+              when your cluster has many remote HDFS reads.
+            </p>
+          </li>
 
-      <p>
-        Because this feature only involves HDFS data files, it does not apply to non-HDFS
-        tables, such as Kudu or HBase tables, or tables that store their data on cloud services
-        such as S3 or ADLS.
+          <li>
+            S3 in <keyword keyref="impala33_full"/> and higher
+            <p>
+              The <codeph>cache_s3_file_handles</codeph> <codeph>impalad</codeph> flag controls
+              the S3 file handle caching. The feature is enabled by default with the flag set to
+              <codeph>true</codeph>.
+            </p>
+          </li>
+        </ul>
       </p>
 
       <p>
@@ -1041,17 +1074,17 @@ these tables, hint the plan or disable this behavior via query options to enable
       </p>
 
       <p>
-        If a manual HDFS operation moves a file to the HDFS Trashcan while the file handle is
-        cached, Impala still accesses the contents of that file. This is a change from prior
-        behavior. Previously, accessing a file that was in the trashcan would cause an error.
-        This behavior only applies to non-Impala methods of removing HDFS files, not the Impala
-        mechanisms such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>.
+        If a manual operation moves a file to the trashcan while the file handle is cached,
+        Impala still accesses the contents of that file. This is a change from prior behavior.
+        Previously, accessing a file that was in the trashcan would cause an error. This
+        behavior only applies to non-Impala methods of removing files, not the Impala mechanisms
+        such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>.
       </p>
 
       <p>
-        If files are removed, replaced, or appended by HDFS operations outside of Impala, the
-        way to bring the file information up to date is to run the <codeph>REFRESH</codeph>
-        statement on the table.
+        If files are removed, replaced, or appended by operations outside of Impala, the way to
+        bring the file information up to date is to run the <codeph>REFRESH</codeph> statement
+        on the table.
       </p>
 
       <p>
@@ -1071,12 +1104,20 @@ these tables, hint the plan or disable this behavior via query options to enable
 
       <p>
         To see metrics about file handle caching for each <cmdname>impalad</cmdname> instance,
-        examine the <uicontrol>/metrics</uicontrol> page in the Impala Web UI, in particular the
-        fields <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>,
-        <uicontrol>impala-server.io.mgr.cached-file-handles-hit-count</uicontrol>, and
-        <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>.
+        examine the following fields on the <uicontrol>/metrics</uicontrol> page in the Impala
+        Web UI:
       </p>
 
+      <ul>
+        <li>
+          <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>
+        </li>
+
+        <li>
+          <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>
+        </li>
+      </ul>
+
     </conbody>
 
   </concept>

[impala] 03/04: IMPALA-8116: [DOCS] A new doc for Impala Scaling Limits

Posted by to...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

todd pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit e21764b912b3118ee62defc07a568aa7e84790e5
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Tue May 7 16:27:03 2019 -0700

    IMPALA-8116: [DOCS] A new doc for Impala Scaling Limits
    
    - Listed the known/tested SCALING Limits.
    - Unknown limits are marked hidden for now. When the numbers
    are available, will remove the hidden tag.
    
    Change-Id: Ie6df672e5de1fb2d34f6b78524e8f20e85ea34fb
    Reviewed-on: http://gerrit.cloudera.org:8080/13277
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
---
 docs/impala.ditamap                   |   1 +
 docs/topics/impala_scaling_limits.xml | 364 ++++++++++++++++++++++++++++++++++
 2 files changed, 365 insertions(+)

diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 2468e4c..ed69762 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -297,6 +297,7 @@ under the License.
     <topicref audience="hidden" href="topics/impala_perf_ddl.xml"/>
   </topicref>
   <topicref href="topics/impala_scalability.xml">
+    <topicref href="topics/impala_scaling_limits.xml"/>
     <topicref href="topics/impala_dedicated_coordinator.xml"/>
     <topicref href="topics/impala_metadata.xml"/>
   </topicref>
diff --git a/docs/topics/impala_scaling_limits.xml b/docs/topics/impala_scaling_limits.xml
new file mode 100644
index 0000000..ba82406
--- /dev/null
+++ b/docs/topics/impala_scaling_limits.xml
@@ -0,0 +1,364 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="impala_scaling_limits">
+
+  <title>Scaling Limits and Guidelines</title>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Scalability"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      This topic lists the <i>scalability</i> limitation in Impala. For a given functional
+      feature, it is recommended that you respect these limitations to achieve optimal
+      scalability and performance. For example, you might be able to create a table with 2000
+      columns, you will experience performance problems while querying the table. This topic
+      does not cover functional limitations in Impala.
+    </p>
+
+    <p>
+      Unless noted otherwise, the limits were tested and certified.
+    </p>
+
+    <p>
+      The limits noted as "<i>generally safe</i>" are not certified, but recommended as
+      generally safe. A safe range is not a hard limit as unforeseen errors or troubles in your
+      particular environment can affect the range.
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+  </conbody>
+
+  <concept id="deployment_limits">
+
+    <title>Deployment Limits</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          Number of Impalad Executors
+          <ul>
+            <li>
+              80 nodes in CDH 5.14 and lower
+            </li>
+
+            <li>
+              150 nodes in CDH 5.15 and higher
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          Number of Impalad Coordinators: 1 coordinator for at most every 50 executors
+          <p>
+            See
+            <xref
+              href="https://www.cloudera.com/documentation/enterprise/6/6.2/topics/impala_dedicated_coordinator.html#concept_vhv_4b1_n2b"
+              format="html" scope="external">Dedicated
+            Coordinators</xref> for details.
+          </p>
+        </li>
+
+        <li audience="hidden">
+          Max memory
+        </li>
+
+        <li audience="hidden">
+          Max number of CPU cores
+        </li>
+
+        <li audience="hidden">
+          Max number of disks
+        </li>
+      </ul>
+
+      <ul>
+        <li>
+          The number of Impala clusters per deployment
+          <ul>
+            <li>
+              1 Impala cluster in Impala 3.1 and lower
+            </li>
+
+            <li>
+              Multiple clusters in Impala 3.2 and higher is <i>generally safe</i>.
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="data_storage_limits">
+
+    <title>Data Storage Limits</title>
+
+    <conbody>
+
+      <p>
+        There are no hard limits for the following, but you will experience gradual performance
+        degradation as you increase these numbers.
+      </p>
+
+      <ul>
+        <li>
+          Number of databases
+        </li>
+
+        <li>
+          Number of tables - total, per database
+        </li>
+
+        <li>
+          Number of partitions - total, per table
+        </li>
+
+        <li>
+          Number of files - total, per table, per table per partition
+        </li>
+
+        <li>
+          Number of views - total, per database
+        </li>
+
+        <li>
+          Number of user-defined functions - total, per database
+        </li>
+
+        <li>
+          Parquet
+          <ul>
+            <li>
+              Number of columns per row group
+            </li>
+
+            <li>
+              Number of row groups per block
+            </li>
+
+            <li>
+              Number of HDFS blocks per file
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="schema_design_limits">
+
+    <title>Schema Design Limits</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          Number of columns
+          <ul>
+            <li>
+              300 for Kudu tables
+              <p>
+                See
+                <xref
+                  href="https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_limitations.html"
+                  format="html" scope="external">Kudu
+                Usage Limitations</xref> for more information.
+              </p>
+            </li>
+
+            <li>
+              1000 for other types of tables
+            </li>
+          </ul>
+        </li>
+
+        <li audience="hidden">
+          Table and column name length
+        </li>
+
+        <li audience="hidden">
+          Maximum cell size
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="security_limits">
+
+    <title>Security Limits</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          Number of roles: 10,000 for Sentry
+        </li>
+
+        <li audience="hidden">
+          Number of columns used in column level ACL
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="ddl_limits" audience="hidden">
+
+    <title>Ingestion and DDL Limits</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          Number of DDL operations per minutes
+        </li>
+
+        <li>
+          Number of concurrent DDL operations
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="query_compile_limits">
+
+    <title>Query Limits - Compile Time</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          Maximum number of columns in a query, included in a <codeph>SELECT</codeph> list,
+          <codeph>INSERT</codeph>, and in an expression: no limit
+        </li>
+
+        <li>
+          Number of tables referenced: no limit
+        </li>
+
+        <li>
+          Number of plan nodes: no limit
+        </li>
+
+        <li>
+          Number of plan fragments: no limit
+        </li>
+
+        <li>
+          Depth of expression tree: 1000 hard limit
+        </li>
+
+        <li>
+          Width of expression tree: 10,000 hard limit
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="query_runtime_limits">
+
+    <title>Query Limits - Runtime Time</title>
+
+    <conbody>
+
+      <ul>
+        <li audience="hidden">
+          Number of Fragment and fragment instances
+        </li>
+
+        <li>
+          Codegen
+          <ul>
+            <li>
+              Very deeply nested expressions within queries can exceed internal Impala limits,
+              leading to excessive memory usage. Setting the query option
+              <codeph>disable_codegen=true</codeph> may reduce the impact, at a cost of longer
+              query runtime.
+            </li>
+          </ul>
+        </li>
+
+        <li audience="hidden">
+          Runtime Filter
+          <ul>
+            <li>
+              Max #filter
+            </li>
+
+            <li>
+              Max filter size
+            </li>
+          </ul>
+        </li>
+
+        <li audience="hidden">
+          Query Operators
+          <ul>
+            <li>
+              Scan
+            </li>
+
+            <li>
+              Join
+            </li>
+
+            <li>
+              Exchange
+            </li>
+
+            <li>
+              Agg
+            </li>
+
+            <li>
+              Sort
+            </li>
+
+            <li>
+              Merge
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+</concept>

[impala] 01/04: [DOCS] Added back the note about Deflate not supported for text files.

Posted by to...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

todd pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit aaf1cf0e9dc473b8ddc5034a96427de87e12f085
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Wed May 15 17:22:36 2019 -0700

    [DOCS] Added back the note about Deflate not supported for text files.
    
    - The text was removed in impala-7107. Putting it back per Sahil's
    request.
    
    Change-Id: If44d3ad0653d73de030e8928077760a15ea18877
    Reviewed-on: http://gerrit.cloudera.org:8080/13350
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Sahil Takiar <st...@cloudera.com>
    Reviewed-by: Alex Rodoni <ar...@cloudera.com>
---
 docs/topics/impala_file_formats.xml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/topics/impala_file_formats.xml b/docs/topics/impala_file_formats.xml
index 7670ca3..b5f0d77 100644
--- a/docs/topics/impala_file_formats.xml
+++ b/docs/topics/impala_file_formats.xml
@@ -279,7 +279,9 @@ under the License.
         </dt>
 
         <dd>
-          <p></p>
+          <p>
+            Not supported for text files.
+          </p>
         </dd>
 
       </dlentry>