You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@impala.apache.org by ta...@apache.org on 2020/04/16 15:46:59 UTC

[impala] branch master updated (34018f6 -> dc410a2)

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git.


    from 34018f6  IMPALA-9629: Add CentOS 8.1 support to bootstrap_system.sh
     new e863bac  IMPALA-9617: Skip tests that use Hive on non-HDFS filesystems
     new 21aa514  IMPALA-9616 [DOC]: Document spill to disk startup options
     new c97191b  IMPALA-9626: Use Python from the toolchain for Impala
     new dc410a2  IMPALA-9596: deflake test_tpch_mem_limit_single_node

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 be/src/exec/hdfs-orc-scanner.cc                    |  5 +--
 be/src/runtime/collection-value-builder-test.cc    |  5 ++-
 be/src/runtime/collection-value-builder.h          | 12 +++++-
 be/src/runtime/row-batch-serialize-test.cc         | 41 ++++++++++++--------
 bin/bootstrap_toolchain.py                         | 44 +++++++++++++++-------
 bin/impala-config.sh                               |  2 +
 bin/set-pythonpath.sh                              |  6 ++-
 docs/impala.ditamap                                |  5 ++-
 docs/topics/impala_disk_space.xml                  | 29 +++++++++++++-
 infra/python/bootstrap_virtualenv.py               | 35 +++++++++--------
 .../QueryTest/nested-types-tpch-errors.test        | 17 +++++++++
 .../nested-types-tpch-mem-limit-single-node.test   | 18 ---------
 tests/conftest.py                                  | 10 +++++
 tests/query_test/test_mt_dop.py                    |  3 ++
 tests/query_test/test_nested_types.py              | 24 ++++++------
 tests/query_test/test_scanners.py                  |  4 +-
 tests/query_test/test_scanners_fuzz.py             |  4 +-
 17 files changed, 177 insertions(+), 87 deletions(-)
 create mode 100644 testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-errors.test
 delete mode 100644 testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test

[impala] 03/04: IMPALA-9626: Use Python from the toolchain for Impala

Posted by ta...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit c97191b6a5c86f63af2a23d72c290d9f13387bd6
Author: Laszlo Gaal <la...@cloudera.com>
AuthorDate: Wed Mar 18 18:03:24 2020 +0100

    IMPALA-9626: Use Python from the toolchain for Impala
    
    Historically Impala used the Python2 version that was available on
    the hosting platform, as long as that version was at least v2.6.
    This caused constant headache as all Python syntax had to be kept
    compatible with Python 2.6 (for Centos 6). It also caused a recent problem
    on Centos 8: here the system Python version was compiled with the
    system's GCC version (v8.3), which was much more recent than the Impala
    standard compiler version (GCC 4.9.2). When the Impala virtualenv was
    built, the system Python version supplied C compiler switches for models
    containing native code that were unknown for the Impala version of GCC,
    thus breaking virtualenv installation.
    
    This patch changes the Impala virtualenv to always use the Python2
    version from the toolchain, which is built with the toolchain compiler.
    
    This ensures that
    - Impala always has a known Python 2.7 version for all its scripts,
    - virtualenv modules based on native code will always be installable, as
      the Python environment and the modules are built with the same compiler
      version.
    
    Additional changes:
    - Add an auto-use fixture to conftest.py to check that the tests are
      being run with Python 2.7.x
    - Make bootstrap_toolchain.py independent from the Impala virtualenv:
      remove the dependency on the "sh" library
    
    Tests:
    - Passed core-mode tests on CentOS 7.4
    - Passed core-mode tests in Docker-based mode for centos:7
      and ubuntu:16.04
    
    Most content in this patch was developed but not published earlier
    by Tim Armstrong.
    
    Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
    Reviewed-on: http://gerrit.cloudera.org:8080/15624
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 bin/bootstrap_toolchain.py           | 44 +++++++++++++++++++++++++-----------
 bin/impala-config.sh                 |  2 ++
 bin/set-pythonpath.sh                |  6 ++++-
 infra/python/bootstrap_virtualenv.py | 35 +++++++++++++++-------------
 tests/conftest.py                    | 10 ++++++++
 5 files changed, 67 insertions(+), 30 deletions(-)

diff --git a/bin/bootstrap_toolchain.py b/bin/bootstrap_toolchain.py
index 69812e2..82362de 100755
--- a/bin/bootstrap_toolchain.py
+++ b/bin/bootstrap_toolchain.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env impala-python
+#!/usr/bin/env python
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
@@ -58,18 +58,12 @@
 #
 # The script is directly executable, and it takes no parameters:
 #     ./bootstrap_toolchain.py
-# It should NOT be run via 'python bootstrap_toolchain.py', as it relies on a specific
-# python environment.
 import logging
 import glob
 import multiprocessing.pool
 import os
 import random
 import re
-# TODO: This file should be runnable without using impala-python, and system python
-# does not have 'sh' available. Rework code to avoid importing sh (and anything else
-# that gets in the way).
-import sh
 import shutil
 import subprocess
 import sys
@@ -107,6 +101,26 @@ OS_MAPPING = [
 ]
 
 
+def check_output(cmd_args):
+  """Run the command and return the output. Raise an exception if the command returns
+     a non-zero return code. Similar to subprocess.check_output() which is only provided
+     in python 2.7.
+  """
+  process = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+  stdout, _ = process.communicate()
+  if process.wait() != 0:
+    raise Exception("Command with args '%s' failed with exit code %s:\n%s"
+        % (cmd_args, process.returncode, stdout))
+  return stdout
+
+
+def get_toolchain_compiler():
+  """Return the <name>-<version> string for the compiler package to use for the
+  toolchain."""
+  # Currently we always use GCC.
+  return "gcc-{0}".format(os.environ["IMPALA_GCC_VERSION"])
+
+
 def wget_and_unpack_package(download_path, file_name, destination, wget_no_clobber):
   if not download_path.endswith("/" + file_name):
     raise Exception("URL {0} does not match with expected file_name {1}"
@@ -117,7 +131,10 @@ def wget_and_unpack_package(download_path, file_name, destination, wget_no_clobb
       download_path, destination, file_name, attempt))
     # --no-clobber avoids downloading the file if a file with the name already exists
     try:
-      sh.wget(download_path, directory_prefix=destination, no_clobber=wget_no_clobber)
+      cmd = ["wget", download_path, "--directory-prefix={0}".format(destination)]
+      if wget_no_clobber:
+        cmd.append("--no-clobber")
+      check_output(cmd)
       break
     except Exception, e:
       if attempt == NUM_ATTEMPTS:
@@ -125,8 +142,9 @@ def wget_and_unpack_package(download_path, file_name, destination, wget_no_clobb
       logging.error("Download failed; retrying after sleep: " + str(e))
       time.sleep(10 + random.random() * 5)  # Sleep between 10 and 15 seconds.
   logging.info("Extracting {0}".format(file_name))
-  sh.tar(z=True, x=True, f=os.path.join(destination, file_name), directory=destination)
-  sh.rm(os.path.join(destination, file_name))
+  check_output(["tar", "xzf", os.path.join(destination, file_name),
+                "--directory={0}".format(destination)])
+  os.unlink(os.path.join(destination, file_name))
 
 
 class DownloadUnpackTarball(object):
@@ -241,7 +259,7 @@ class ToolchainPackage(EnvVersionedPackage):
       logging.error("Impala environment not set up correctly, make sure "
           "$IMPALA_TOOLCHAIN is set.")
       sys.exit(1)
-    compiler = "gcc-{0}".format(os.environ["IMPALA_GCC_VERSION"])
+    compiler = get_toolchain_compiler()
     label = get_platform_release_label(release=platform_release).toolchain
     toolchain_build_id = os.environ["IMPALA_TOOLCHAIN_BUILD_ID"]
     toolchain_host = os.environ["IMPALA_TOOLCHAIN_HOST"]
@@ -409,7 +427,8 @@ def get_platform_release_label(release=None):
     if lsb_release_cache:
       release = lsb_release_cache
     else:
-      release = "".join(map(lambda x: x.lower(), sh.lsb_release("-irs").split()))
+      lsb_release = check_output(["lsb_release", "-irs"])
+      release = "".join(map(lambda x: x.lower(), lsb_release.split()))
       # Only need to check against the major release if RHEL or CentOS
       for platform in ['centos', 'redhatenterpriseserver']:
         if platform in release:
@@ -419,7 +438,6 @@ def get_platform_release_label(release=None):
   for mapping in OS_MAPPING:
     if re.search(mapping.lsb_release, release):
       return mapping
-
   raise Exception("Could not find package label for OS version: {0}.".format(release))
 
 
diff --git a/bin/impala-config.sh b/bin/impala-config.sh
index 49278db..7dbbefd 100755
--- a/bin/impala-config.sh
+++ b/bin/impala-config.sh
@@ -135,6 +135,8 @@ export IMPALA_PROTOBUF_VERSION=3.5.1
 unset IMPALA_PROTOBUF_URL
 export IMPALA_POSTGRES_JDBC_DRIVER_VERSION=42.2.5
 unset IMPALA_POSTGRES_JDBC_DRIVER_URL
+export IMPALA_PYTHON_VERSION=2.7.16
+unset IMPALA_PYTHON_URL
 export IMPALA_RAPIDJSON_VERSION=1.1.0
 unset IMPALA_RAPIDJSON_URL
 export IMPALA_RE2_VERSION=20190301
diff --git a/bin/set-pythonpath.sh b/bin/set-pythonpath.sh
index 7bf8bf7..6b19b20 100755
--- a/bin/set-pythonpath.sh
+++ b/bin/set-pythonpath.sh
@@ -22,7 +22,9 @@
 # Setting USE_THRIFT11_GEN_PY will add Thrift 11 Python generated code rather than the
 # default Thrift Python code.
 # Used to allow importing testdata, test, etc modules from other scripts.
-export PYTHONPATH=${IMPALA_HOME}
+
+# ${IMPALA_HOME}/bin has bootstrap_toolchain.py, required by bootstrap_virtualenv.py
+export PYTHONPATH=${IMPALA_HOME}:${IMPALA_HOME}/bin
 
 # Generated Thrift files are used by tests and other scripts.
 if [ -n "${USE_THRIFT11_GEN_PY:-}" ]; then
@@ -31,6 +33,8 @@ else
   PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/shell/gen-py
 fi
 
+PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/infra/python/env/lib
+
 # There should be just a single version of python that created the
 # site-packages directory. We find it by performing shell independent expansion
 # of the following pattern:
diff --git a/infra/python/bootstrap_virtualenv.py b/infra/python/bootstrap_virtualenv.py
index 27c527f..cccdfe0 100644
--- a/infra/python/bootstrap_virtualenv.py
+++ b/infra/python/bootstrap_virtualenv.py
@@ -46,6 +46,7 @@ import tarfile
 import tempfile
 import textwrap
 import urllib
+from bootstrap_toolchain import ToolchainPackage
 
 LOG = logging.getLogger(os.path.splitext(os.path.basename(__file__))[0])
 
@@ -83,7 +84,7 @@ def create_virtualenv():
   for member in file.getmembers():
     file.extract(member, build_dir)
   file.close()
-  python_cmd = detect_python_cmd()
+  python_cmd = download_toolchain_python()
   exec_cmd([python_cmd, find_file(build_dir, "virtualenv*", "virtualenv.py"), "--quiet",
       "--python", python_cmd, ENV_DIR])
   shutil.rmtree(build_dir)
@@ -189,21 +190,23 @@ def find_file(*paths):
   return files[0]
 
 
-def detect_python_cmd():
-  '''Returns the system command that provides python 2.6 or greater.'''
-  paths = os.getenv("PATH").split(os.path.pathsep)
-  for cmd in ("python", "python27", "python2.7", "python-27", "python-2.7", "python26",
-      "python2.6", "python-26", "python-2.6"):
-    for path in paths:
-      cmd_path = os.path.join(path, cmd)
-      if not os.path.exists(cmd_path) or not os.access(cmd_path, os.X_OK):
-        continue
-      exit = subprocess.call([cmd_path, "-c", textwrap.dedent("""
-          import sys
-          sys.exit(int(sys.version_info[:2] < (2, 6)))""")])
-      if exit == 0:
-        return cmd_path
-  raise Exception("Could not find minimum required python version 2.6")
+def download_toolchain_python():
+  '''Grabs the Python implementation from the Impala toolchain, using the machinery from
+     bin/bootstrap_toolchain.py
+  '''
+
+  toolchain_root = os.environ.get("IMPALA_TOOLCHAIN")
+  if not toolchain_root:
+    raise Exception(
+        "Impala environment not set up correctly, make sure $IMPALA_TOOLCHAIN is set.")
+
+  package = ToolchainPackage("python")
+  package.download()
+  python_cmd = os.path.join(package.pkg_directory(), "bin/python")
+  if not os.path.exists(python_cmd):
+    raise Exception("Unexpected error bootstrapping python from toolchain: {0} does not "
+                    "exist".format(python_cmd))
+  return python_cmd
 
 
 def install_deps():
diff --git a/tests/conftest.py b/tests/conftest.py
index 21438c7..d84f8e3 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -26,6 +26,7 @@ import contextlib
 import logging
 import os
 import pytest
+import sys
 
 import tests.common
 from impala_py_lib.helpers import find_all_files, is_core_dump
@@ -609,6 +610,15 @@ def cluster_properties():
   yield cluster_properties
 
 
+@pytest.fixture(autouse=True, scope='session')
+def validate_python_version():
+  """Check the Python runtime version before running any tests. Since Impala switched
+     to the toolchain Python, which is at least v2.7, the tests will not run on a version
+     below that.
+  """
+  assert sys.version_info > (2, 7), "Tests only support Python 2.7+"
+
+
 @pytest.hookimpl(trylast=True)
 def pytest_collection_modifyitems(items, config, session):
   """Hook to handle --shard_tests command line option.

[impala] 04/04: IMPALA-9596: deflake test_tpch_mem_limit_single_node

Posted by ta...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit dc410a2cf47bcf06a0f4563d05a9d0a339af5fb2
Author: Tim Armstrong <ta...@cloudera.com>
AuthorDate: Thu Apr 9 16:11:42 2020 -0700

    IMPALA-9596: deflake test_tpch_mem_limit_single_node
    
    This changes the test to use a debug action instead of
    trying to hit the memory limit in the right spot, which
    has tended to be flaky. This still exercises the error
    handling code in the scanner, which was the original
    point of the test (see IMPALA-2376).
    
    This revealed an actual bug in the ORC scanner, where
    it was not returning the error directly from
    AssembleCollection(). Before I fixed that, the scanner
    got stuck in an infinite loop when running the test.
    
    Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
    Reviewed-on: http://gerrit.cloudera.org:8080/15700
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Gabor Kaszab <ga...@cloudera.com>
---
 be/src/exec/hdfs-orc-scanner.cc                    |  5 ++-
 be/src/runtime/collection-value-builder-test.cc    |  5 ++-
 be/src/runtime/collection-value-builder.h          | 12 ++++++-
 be/src/runtime/row-batch-serialize-test.cc         | 41 +++++++++++++---------
 .../QueryTest/nested-types-tpch-errors.test        | 17 +++++++++
 .../nested-types-tpch-mem-limit-single-node.test   | 18 ----------
 tests/query_test/test_nested_types.py              | 17 +++------
 7 files changed, 63 insertions(+), 52 deletions(-)

diff --git a/be/src/exec/hdfs-orc-scanner.cc b/be/src/exec/hdfs-orc-scanner.cc
index 05e1198..0b1a599 100644
--- a/be/src/exec/hdfs-orc-scanner.cc
+++ b/be/src/exec/hdfs-orc-scanner.cc
@@ -813,9 +813,8 @@ Status HdfsOrcScanner::AssembleCollection(
 
     int64_t num_rows;
     // We're assembling item tuples into an CollectionValue
-    parse_status_ =
-        GetCollectionMemory(coll_value_builder, &pool, &tuple, &row, &num_rows);
-    if (UNLIKELY(!parse_status_.ok())) break;
+    RETURN_IF_ERROR(
+        GetCollectionMemory(coll_value_builder, &pool, &tuple, &row, &num_rows));
     // 'num_rows' can be very high if we're writing to a large CollectionValue. Limit
     // the number of rows we read at one time so we don't spend too long in the
     // 'num_rows' loop below before checking for cancellation or limit reached.
diff --git a/be/src/runtime/collection-value-builder-test.cc b/be/src/runtime/collection-value-builder-test.cc
index af710ce..c1bde10 100644
--- a/be/src/runtime/collection-value-builder-test.cc
+++ b/be/src/runtime/collection-value-builder-test.cc
@@ -33,6 +33,9 @@ static scoped_ptr<Frontend> fe;
 TEST(CollectionValueBuilderTest, MaxBufferSize) {
   TestEnv test_env;
   ASSERT_OK(test_env.Init());
+  TQueryOptions opts;
+  RuntimeState* runtime_state;
+  ASSERT_OK(test_env.CreateQueryState(1234, &opts, &runtime_state));
   ObjectPool obj_pool;
   DescriptorTblBuilder builder(fe.get(), &obj_pool);
   builder.DeclareTuple() << TYPE_TINYINT << TYPE_TINYINT << TYPE_TINYINT;
@@ -51,7 +54,7 @@ TEST(CollectionValueBuilderTest, MaxBufferSize) {
   MemTracker tracker(mem_limit);
   MemPool pool(&tracker);
   CollectionValueBuilder coll_value_builder(
-      &coll_value, tuple_desc, &pool, NULL, initial_capacity);
+      &coll_value, tuple_desc, &pool, runtime_state, initial_capacity);
   EXPECT_EQ(tracker.consumption(), initial_capacity * 4);
 
   // Attempt to double the buffer so it goes over 32-bit INT_MAX.
diff --git a/be/src/runtime/collection-value-builder.h b/be/src/runtime/collection-value-builder.h
index ba9ddd6..e7a47bb 100644
--- a/be/src/runtime/collection-value-builder.h
+++ b/be/src/runtime/collection-value-builder.h
@@ -20,6 +20,7 @@
 
 #include "runtime/collection-value.h"
 #include "runtime/mem-tracker.h"
+#include "runtime/runtime-state.h"
 #include "runtime/tuple.h"
 #include "util/debug-util.h"
 #include "util/ubsan.h"
@@ -40,7 +41,8 @@ class CollectionValueBuilder {
     : coll_value_(coll_value),
       tuple_desc_(tuple_desc),
       pool_(pool),
-      state_(state) {
+      state_(state),
+      have_debug_action_(!state->query_options().debug_action.empty()) {
     buffer_size_ = initial_tuple_capacity * tuple_desc_.byte_size();
     coll_value_->ptr = pool_->TryAllocate(buffer_size_);
     if (coll_value_->ptr == NULL) buffer_size_ = 0;
@@ -60,6 +62,10 @@ class CollectionValueBuilder {
       int64_t bytes_written = coll_value_->ByteSize(tuple_desc_);
       DCHECK_GE(buffer_size_, bytes_written);
       if (buffer_size_ == bytes_written) {
+        if (UNLIKELY(have_debug_action_)) {
+          RETURN_IF_ERROR(
+              DebugAction(state_->query_options(), "SCANNER_COLLECTION_ALLOC"));
+        }
         // Double tuple buffer
         int64_t new_buffer_size =
             std::max<int64_t>(buffer_size_ * 2, tuple_desc_.byte_size());
@@ -107,6 +113,10 @@ class CollectionValueBuilder {
   /// May be NULL. If non-NULL, used to log memory limit errors.
   RuntimeState* state_;
 
+  /// Whether 'state_' has a debug action set. Used to reduce overhead of
+  /// the check that is run once per collection.
+  const bool have_debug_action_;
+
   /// The current size of coll_value_'s buffer in bytes, including any unused space
   /// (i.e. buffer_size_ is equal to or larger than coll_value_->ByteSize()).
   int64_t buffer_size_;
diff --git a/be/src/runtime/row-batch-serialize-test.cc b/be/src/runtime/row-batch-serialize-test.cc
index fcac615..99a09ea 100644
--- a/be/src/runtime/row-batch-serialize-test.cc
+++ b/be/src/runtime/row-batch-serialize-test.cc
@@ -25,6 +25,7 @@
 #include "runtime/raw-value.h"
 #include "runtime/raw-value.inline.h"
 #include "runtime/row-batch.h"
+#include "runtime/test-env.h"
 #include "runtime/tuple-row.h"
 #include "service/fe-support.h"
 #include "service/frontend.h"
@@ -47,20 +48,28 @@ class RowBatchSerializeTest : public testing::Test {
   ObjectPool pool_;
   scoped_ptr<MemTracker> tracker_;
 
-  // For computing tuple mem layouts.
-  scoped_ptr<Frontend> fe_;
+  scoped_ptr<TestEnv> test_env_;
+  RuntimeState* runtime_state_ = nullptr;
+
+  TQueryOptions dummy_query_opts_;
 
   virtual void SetUp() {
-    fe_.reset(new Frontend());
+    test_env_.reset(new TestEnv);
+    ASSERT_OK(test_env_->Init());
     tracker_.reset(new MemTracker());
+    ASSERT_OK(test_env_->CreateQueryState(1234, &dummy_query_opts_, &runtime_state_));
   }
 
   virtual void TearDown() {
     pool_.Clear();
     tracker_.reset();
-    fe_.reset();
+    test_env_.reset();
+    runtime_state_ = nullptr;
   }
 
+  /// Helper to get frontend from 'test_env_'.
+  Frontend* frontend() const { return test_env_->exec_env()->frontend(); }
+
   // Serializes and deserializes 'batch', then checks that the deserialized batch is valid
   // and has the same contents as 'batch'. If serialization returns an error (e.g. if the
   // row batch is too large to serialize), this will return that error.
@@ -104,7 +113,7 @@ class RowBatchSerializeTest : public testing::Test {
     // tuple: (int, string, string, string)
     // This uses three strings so that this test can reach INT_MAX+1 without any
     // single string exceeding the 1GB limit on string length (see string-value.h).
-    DescriptorTblBuilder builder(fe_.get(), &pool_);
+    DescriptorTblBuilder builder(frontend(), &pool_);
     builder.DeclareTuple() << TYPE_INT << TYPE_STRING << TYPE_STRING << TYPE_STRING;
     DescriptorTbl* desc_tbl = builder.Build();
 
@@ -253,7 +262,7 @@ class RowBatchSerializeTest : public testing::Test {
         const TupleDescriptor* item_desc = slot_desc.collection_item_descriptor();
         int array_len = rand() % (MAX_ARRAY_LEN + 1);
         CollectionValue cv;
-        CollectionValueBuilder builder(&cv, *item_desc, pool, NULL, array_len);
+        CollectionValueBuilder builder(&cv, *item_desc, pool, runtime_state_, array_len);
         Tuple* tuple_mem;
         int n;
         EXPECT_OK(builder.GetFreeMemory(&tuple_mem, &n));
@@ -380,7 +389,7 @@ class RowBatchSerializeTest : public testing::Test {
 
 TEST_F(RowBatchSerializeTest, Basic) {
   // tuple: (int)
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -395,7 +404,7 @@ TEST_F(RowBatchSerializeTest, Basic) {
 
 TEST_F(RowBatchSerializeTest, String) {
   // tuple: (int, string)
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT << TYPE_STRING;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -441,7 +450,7 @@ TEST_F(RowBatchSerializeTest, BasicArray) {
   array_type.type = TYPE_ARRAY;
   array_type.children.push_back(TYPE_INT);
 
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT << TYPE_STRING << array_type;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -469,7 +478,7 @@ TEST_F(RowBatchSerializeTest, StringArray) {
   array_type.type = TYPE_ARRAY;
   array_type.children.push_back(struct_type);
 
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT << TYPE_STRING << array_type;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -510,7 +519,7 @@ TEST_F(RowBatchSerializeTest, NestedArrays) {
   array_type.type = TYPE_ARRAY;
   array_type.children.push_back(struct_type);
 
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << array_type;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -534,7 +543,7 @@ TEST_F(RowBatchSerializeTest, DupCorrectnessFull) {
 
 void RowBatchSerializeTest::TestDupCorrectness(bool full_dedup) {
   // tuples: (int), (string)
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT;
   builder.DeclareTuple() << TYPE_STRING;
   DescriptorTbl* desc_tbl = builder.Build();
@@ -575,7 +584,7 @@ TEST_F(RowBatchSerializeTest, DupRemovalFull) {
 // Test that tuple deduplication results in the expected reduction in serialized size.
 void RowBatchSerializeTest::TestDupRemoval(bool full_dedup) {
   // tuples: (int, string)
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT << TYPE_STRING;
   DescriptorTbl* desc_tbl = builder.Build();
 
@@ -614,7 +623,7 @@ TEST_F(RowBatchSerializeTest, ConsecutiveNullsFull) {
 // Test that deduplication handles NULL tuples correctly.
 void RowBatchSerializeTest::TestConsecutiveNulls(bool full_dedup) {
   // tuples: (int)
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT;
   DescriptorTbl* desc_tbl = builder.Build();
   vector<bool> nullable_tuples(1, true);
@@ -642,7 +651,7 @@ TEST_F(RowBatchSerializeTest, ZeroLengthTuplesDedup) {
 
 void RowBatchSerializeTest::TestZeroLengthTuple(bool full_dedup) {
   // tuples: (int), (string), ()
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT;
   builder.DeclareTuple() << TYPE_STRING;
   builder.DeclareTuple();
@@ -669,7 +678,7 @@ TEST_F(RowBatchSerializeTest, DedupPathologicalFull) {
   ColumnType array_type;
   array_type.type = TYPE_ARRAY;
   array_type.children.push_back(TYPE_STRING);
-  DescriptorTblBuilder builder(fe_.get(), &pool_);
+  DescriptorTblBuilder builder(frontend(), &pool_);
   builder.DeclareTuple() << TYPE_INT;
   builder.DeclareTuple() << TYPE_INT;
   builder.DeclareTuple() << array_type;
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-errors.test b/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-errors.test
new file mode 100644
index 0000000..2f114d4
--- /dev/null
+++ b/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-errors.test
@@ -0,0 +1,17 @@
+====
+---- QUERY
+# IMPALA-2376: test error handling when hitting memory limit during allocation of
+# a collection in the scanner. Use debug action to make the failure deterministic
+# (when setting the real mem_limit, it tends to be non-deterministic where in query
+# execution the error is hit).
+set debug_action="SCANNER_COLLECTION_ALLOC:FAIL@1.0";
+select max(cnt1), max(cnt2), max(cnt3), max(cnt4), max(cnt5)
+from customer c,
+  (select count(l_returnflag) cnt1, count(l_partkey) cnt2, count(l_suppkey) cnt3,
+          count(l_linenumber) cnt4, count(l_quantity) cnt5
+   from c.c_orders.o_lineitems) v;
+---- TYPES
+BIGINT
+---- CATCH
+Debug Action: SCANNER_COLLECTION_ALLOC:FAIL@1.0
+====
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test b/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
deleted file mode 100644
index 46d2cf8..0000000
--- a/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
+++ /dev/null
@@ -1,18 +0,0 @@
-====
----- QUERY
-# IMPALA-2376: run scan that constructs large collection and set memory limit low enough
-# to get the below query to consistently fail when allocating a large collection. Set
-# num_nodes to 1 in the python test and mt_dop to 1 here in order to make the query as
-# deterministic as possible. mem_limit is tuned for a 3-node HDFS minicluster.
-set buffer_pool_limit=24m;
-set mt_dop=1;
-select max(cnt1), max(cnt2), max(cnt3), max(cnt4), max(cnt5)
-from customer c,
-  (select count(l_returnflag) cnt1, count(l_partkey) cnt2, count(l_suppkey) cnt3,
-          count(l_linenumber) cnt4, count(l_quantity) cnt5
-   from c.c_orders.o_lineitems) v;
----- TYPES
-BIGINT
----- CATCH
-row_regex: .*Memory limit exceeded: Failed to allocate [0-9]+ bytes for collection 'tpch_nested_.*.customer.c_orders.item.o_lineitems'.*
-====
diff --git a/tests/query_test/test_nested_types.py b/tests/query_test/test_nested_types.py
index c95e082..1f9e1a4 100644
--- a/tests/query_test/test_nested_types.py
+++ b/tests/query_test/test_nested_types.py
@@ -144,20 +144,11 @@ class TestNestedTypesNoMtDop(ImpalaTestSuite):
     self.run_test_case('QueryTest/nested-types-tpch-mem-limit', vector,
                        use_db='tpch_nested' + db_suffix)
 
-  @SkipIfNotHdfsMinicluster.tuned_for_minicluster
-  def test_tpch_mem_limit_single_node(self, vector):
-    """Queries over the larger nested TPCH dataset with memory limits tuned for
-    a 3-node HDFS minicluster with num_nodes=1."""
-    new_vector = deepcopy(vector)
-    new_vector.get_value('exec_option')['num_nodes'] = 1
-    if vector.get_value('table_format').file_format == 'orc':
-      # IMPALA-8336: lower memory limit for ORC
-      new_vector.get_value('exec_option')['mem_limit'] = '20M'
-    else:
-      new_vector.get_value('exec_option')['mem_limit'] = '28M'
+  def test_tpch_errors(self, vector):
+    """Queries that test error handling on the TPC-H nested data set."""
     db_suffix = vector.get_value('table_format').db_suffix()
-    self.run_test_case('QueryTest/nested-types-tpch-mem-limit-single-node',
-                       new_vector, use_db='tpch_nested' + db_suffix)
+    self.run_test_case('QueryTest/nested-types-tpch-errors',
+                       vector, use_db='tpch_nested' + db_suffix)
 
   @SkipIfEC.fix_later
   def test_parquet_stats(self, vector):

[impala] 01/04: IMPALA-9617: Skip tests that use Hive on non-HDFS filesystems

Posted by ta...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit e863bac074f70b9d063cd7e53e46b3065b99edcb
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Wed Apr 15 11:38:23 2020 +0200

    IMPALA-9617: Skip tests that use Hive on non-HDFS filesystems
    
    Some tests are flaky due to timeouts in Hive queries on non-HDFS
    filesystems. Until IMPALA-9365 is resolved we only run these tests
    when the target filesystem is HDFS.
    
    Change-Id: I50fe92801e6e0f0ad8e169ec91ca4a8530088b7f
    Reviewed-on: http://gerrit.cloudera.org:8080/15736
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 tests/query_test/test_mt_dop.py        | 3 +++
 tests/query_test/test_nested_types.py  | 7 +++++++
 tests/query_test/test_scanners.py      | 4 +++-
 tests/query_test/test_scanners_fuzz.py | 4 +++-
 4 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/tests/query_test/test_mt_dop.py b/tests/query_test/test_mt_dop.py
index da32e6f..98a3a3c 100644
--- a/tests/query_test/test_mt_dop.py
+++ b/tests/query_test/test_mt_dop.py
@@ -26,6 +26,7 @@ from tests.common.impala_test_suite import ImpalaTestSuite
 from tests.common.kudu_test_suite import KuduTestSuite
 from tests.common.skip import SkipIfABFS, SkipIfEC, SkipIfNotHdfsMinicluster
 from tests.common.test_vector import ImpalaTestDimension
+from tests.util.filesystem_utils import IS_HDFS
 
 WAIT_TIME_MS = build_flavor_timeout(60000, slow_build_timeout=100000)
 
@@ -68,6 +69,8 @@ class TestMtDop(ImpalaTestSuite):
         "create external table %s like functional_hbase.alltypes" % fq_table_name)
       expected_results = "Updated 1 partition(s) and 13 column(s)."
     elif HIVE_MAJOR_VERSION == 3 and file_format == 'orc':
+      # TODO: Enable this test on non-HDFS filesystems once IMPALA-9365 is resolved.
+      if not IS_HDFS: pytest.skip()
       self.run_stmt_in_hive(
           "create table %s like functional_orc_def.alltypes" % fq_table_name)
       self.run_stmt_in_hive(
diff --git a/tests/query_test/test_nested_types.py b/tests/query_test/test_nested_types.py
index dba4dee..c95e082 100644
--- a/tests/query_test/test_nested_types.py
+++ b/tests/query_test/test_nested_types.py
@@ -220,6 +220,13 @@ class TestNestedTypesNoMtDop(ImpalaTestSuite):
     self.run_test_case('QueryTest/nested-types-basic-partitioned', vector,
         unique_database)
 
+  # Skip this test on non-HDFS filesystems, because the test contains Hive
+  # queries that hang in some cases due to IMPALA-9365.
+  @SkipIfABFS.hive
+  @SkipIfADLS.hive
+  @SkipIfIsilon.hive
+  @SkipIfLocal.hive
+  @SkipIfS3.hive
   @SkipIfHive2.acid
   def test_partitioned_table_acid(self, vector, unique_database):
     """IMPALA-6370: Test that a partitioned table with nested types can be scanned."""
diff --git a/tests/query_test/test_scanners.py b/tests/query_test/test_scanners.py
index 9287071..62255ad 100644
--- a/tests/query_test/test_scanners.py
+++ b/tests/query_test/test_scanners.py
@@ -55,7 +55,7 @@ from tests.common.test_result_verifier import (
     QueryTestResult,
     parse_result_rows)
 from tests.common.test_vector import ImpalaTestDimension
-from tests.util.filesystem_utils import WAREHOUSE, get_fs_path
+from tests.util.filesystem_utils import IS_HDFS, WAREHOUSE, get_fs_path
 from tests.util.hdfs_util import NAMENODE
 from tests.util.get_parquet_metadata import get_parquet_metadata
 from tests.util.parse_util import get_bytes_summary_stats_counter
@@ -203,6 +203,8 @@ class TestUnmatchedSchema(ImpalaTestSuite):
     self._drop_test_table(vector)
     file_format = vector.get_value('table_format').file_format
     if file_format == 'orc':
+      # TODO: Enable this test on non-HDFS filesystems once IMPALA-9365 is resolved.
+      if not IS_HDFS: pytest.skip()
       db_name = "functional" + vector.get_value('table_format').db_suffix()
       self.run_stmt_in_hive(
           "create table %s.jointbl_test like functional.jointbl "
diff --git a/tests/query_test/test_scanners_fuzz.py b/tests/query_test/test_scanners_fuzz.py
index a4b16c1..73d734b 100644
--- a/tests/query_test/test_scanners_fuzz.py
+++ b/tests/query_test/test_scanners_fuzz.py
@@ -28,7 +28,7 @@ from subprocess import check_call
 from tests.common.environ import HIVE_MAJOR_VERSION
 from tests.common.test_dimensions import create_exec_option_dimension_from_dict
 from tests.common.impala_test_suite import ImpalaTestSuite, LOG
-from tests.util.filesystem_utils import WAREHOUSE, get_fs_path
+from tests.util.filesystem_utils import IS_HDFS, WAREHOUSE, get_fs_path
 from tests.util.test_file_parser import QueryTestSectionReader
 
 # Random fuzz testing of HDFS scanners. Existing tables for any HDFS file format
@@ -176,6 +176,8 @@ class TestScannersFuzzing(ImpalaTestSuite):
 
     table_format = vector.get_value('table_format')
     if HIVE_MAJOR_VERSION == 3 and table_format.file_format == 'orc':
+      # TODO: Enable this test on non-HDFS filesystems once IMPALA-9365 is resolved.
+      if not IS_HDFS: pytest.skip()
       self.run_stmt_in_hive("create table %s.%s like %s.%s" % (fuzz_db, fuzz_table,
           src_db, src_table))
       self.run_stmt_in_hive("insert into %s.%s select * from %s.%s" % (fuzz_db,

[impala] 02/04: IMPALA-9616 [DOC]: Document spill to disk startup options

Posted by ta...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 21aa51435328fff2343430d26417f67c5586b7e0
Author: Kris Hahn <kh...@cloudera.com>
AuthorDate: Wed Apr 8 17:37:06 2020 -0700

    IMPALA-9616 [DOC]: Document spill to disk startup options
    
    Documented startup option descriptions per review comments:
    --To cover the spill-to-disk compression support
    --To use the disk_spill_punch_holes as required
    Included examples that need to be reviewed and minor edits.
    Change-Id: I3694fe97d74697777a8d50288b406b8eca0aa9fb
    Reviewed-on: http://gerrit.cloudera.org:8080/15692
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
---
 docs/impala.ditamap               |  5 +++--
 docs/topics/impala_disk_space.xml | 29 ++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index c5b3191..9407167 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -194,7 +194,9 @@ under the License.
           <topicref href="topics/impala_enable_expr_rewrites.xml"/>
           <topicref href="topics/impala_exec_single_node_rows_threshold.xml"/>
           <topicref href="topics/impala_exec_time_limit_s.xml"/>
-          <topicref href="topics/impala_explain_level.xml"/>
+          <topicref href="topics/impala_explain_level.xml">
+            <topicref rev="2.5.0" href="topics/impala_max_num_runtime_filters.xml"/>
+          </topicref>
           <topicref href="topics/impala_fetch_rows_timeout_ms.xml"/>
           <topicref href="topics/impala_hbase_cache_blocks.xml"/>
           <topicref href="topics/impala_hbase_caching.xml"/>
@@ -204,7 +206,6 @@ under the License.
           <topicref href="topics/impala_live_summary.xml"/>
           <topicref href="topics/impala_max_errors.xml"/>
           <topicref rev="3.1 IMPALA-6847" href="topics/impala_max_mem_estimate_for_admission.xml"/>
-          <topicref rev="2.5.0" href="topics/impala_max_num_runtime_filters.xml"/>
           <topicref href="topics/impala_max_result_spooling_mem.xml"/>
           <topicref rev="2.10.0 IMPALA-3200" href="topics/impala_max_row_size.xml"/>
           <topicref href="topics/impala_max_scan_range_length.xml"/>
diff --git a/docs/topics/impala_disk_space.xml b/docs/topics/impala_disk_space.xml
index 904e1d4..d1c4ca4 100644
--- a/docs/topics/impala_disk_space.xml
+++ b/docs/topics/impala_disk_space.xml
@@ -276,7 +276,34 @@ under the License.
       </p>
 
     </section>
-
+    <section>
+      <title>Increasing Scratch Capacity</title>
+      <p> You can compress the data spilled to disk to increase the effective scratch capacity. You
+        typically more than double capacity using compression and reduce spilling to disk. Use the
+        --disk_spill_compression_codec and –-disk_spill_punch_holes startup options. The
+        --disk_spill_compression_codec takes any value supported by the COMPRESSION_CODEC query
+        option. The value is not case-sensitive. A value of <codeph>ZSTD</codeph> or
+          <codeph>LZ4</codeph> is recommended (default is NONE).</p>
+      <p>For example:</p>
+<codeblock>--disk_spill_compression_codec=LZ4
+--disk_spill_punch_holes=true
+</codeblock>
+      <p>
+        If you set <codeph>--disk_spill_compression_codec</codeph> to a value other than <codeph>NONE</codeph>, you must set <codeph>--disk_spill_punch_holes</codeph> to true.
+      </p>
+      <p>
+        The hole punching feature supported by many filesystems is used to reclaim space in scratch files during execution
+        of a query that spills to disk. This results in lower scratch space requirements in many cases, especially when
+        combined with disk spill compression. When this option is not enabled, scratch space is still recycled by a query,
+        but less effectively in many cases.
+      </p>
+      <p> You can specify a compression level for <codeph>ZSTD</codeph> only. For example: </p>
+<codeblock>--disk_spill_compression_codec=ZSTD:10
+--disk_spill_punch_holes=true
+</codeblock>
+      <p> Compression levels from 1 up to 22 (default 3) are supported for <codeph>ZSTD</codeph>.
+        The lower the compression level, the faster the speed at the cost of compression ratio.</p>
+    </section>
   </conbody>
 
 </concept>