You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@impala.apache.org by jo...@apache.org on 2018/04/16 18:27:47 UTC

[1/7] impala git commit: Fix test_query_concurrency exception handling.

Repository: impala
Updated Branches:
  refs/heads/master ce09269fd -> 5960d1b36


Fix test_query_concurrency exception handling.

Fixes use of an undefined variable.

I saw the following message in a build failure, which
clearly wasn't intended:

  MainThread: Debug webpage not yet available.
  Exception in thread Thread-862:
  Traceback (most recent call last):
    File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
      self.run()
    File "/usr/lib64/python2.7/threading.py", line 764, in run
      self.__target(*self.__args, **self.__kwargs)
    File "/data/jenkins/workspace/impala-asf-2.x-exhaustive-rhel7/repos/Impala/tests/custom_cluster/test_query_concurrency.py", line 58, in poll_query_page
      except e:
  NameError: global name 'e' is not defined

Change-Id: If507409b8945b16a9510bb6195343eed7d8538fc
Reviewed-on: http://gerrit.cloudera.org:8080/10049
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/bc6c3c74
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/bc6c3c74
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/bc6c3c74

Branch: refs/heads/master
Commit: bc6c3c7447592ed2f17de41d8002207e7aee3d57
Parents: ce09269
Author: Philip Zeyliger <ph...@cloudera.com>
Authored: Thu Apr 12 13:00:06 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Fri Apr 13 23:51:36 2018 +0000

----------------------------------------------------------------------
 tests/custom_cluster/test_query_concurrency.py | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/bc6c3c74/tests/custom_cluster/test_query_concurrency.py
----------------------------------------------------------------------
diff --git a/tests/custom_cluster/test_query_concurrency.py b/tests/custom_cluster/test_query_concurrency.py
index 53bc72b..63cb173 100644
--- a/tests/custom_cluster/test_query_concurrency.py
+++ b/tests/custom_cluster/test_query_concurrency.py
@@ -16,13 +16,9 @@
 # under the License.
 
 import pytest
-import requests
 import time
-from time import localtime, strftime
 from threading import Thread
-from tests.beeswax.impala_beeswax import ImpalaBeeswaxException
 from tests.common.custom_cluster_test_suite import CustomClusterTestSuite
-from tests.common.impala_cluster import ImpalaCluster
 from tests.common.skip import SkipIfBuildType
 
 @SkipIfBuildType.not_dev_build
@@ -55,7 +51,7 @@ class TestQueryConcurrency(CustomClusterTestSuite):
     while time.time() - start < self.POLLING_TIMEOUT_S:
       try:
         impalad.service.read_debug_webpage("query_plan?query_id=" + query_id)
-      except e:
+      except Exception:
         pass
       time.sleep(1)

[7/7] impala git commit: IMPALA-6514: [DOCS] impala-shell option for load balancer and Kerberos

Posted by jo...@apache.org.

IMPALA-6514: [DOCS] impala-shell option for load balancer and Kerberos

Change-Id: I50d2063bfbe4838692777e2019ee3f3a991dfc21
Reviewed-on: http://gerrit.cloudera.org:8080/10047
Reviewed-by: Vincent Tran <vt...@cloudera.com>
Reviewed-by: Alex Rodoni <ar...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/5960d1b3
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/5960d1b3
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/5960d1b3

Branch: refs/heads/master
Commit: 5960d1b364a661a81c4513a33b6e9470282de162
Parents: e53bf27
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Apr 12 11:55:18 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Mon Apr 16 01:50:14 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_proxy.xml         | 40 +++++++++++++++++++++++++++----
 docs/topics/impala_shell_options.xml | 29 ++++++++++++++++++++++
 2 files changed, 64 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/5960d1b3/docs/topics/impala_proxy.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_proxy.xml b/docs/topics/impala_proxy.xml
index 1f5bb4b..588fada 100644
--- a/docs/topics/impala_proxy.xml
+++ b/docs/topics/impala_proxy.xml
@@ -238,11 +238,41 @@ under the License.
         verify that the host they are connecting to is the same one that is
         actually processing the request, to prevent man-in-the-middle attacks.
       </p>
-      <note>
-          Once you enable a proxy server in a Kerberized cluster, users will not
-          be able to connect to individual impala daemons directly from impala
-          shell.
-      </note>
+      <p>
+        In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower
+        versions, once you enable a proxy server in a Kerberized cluster, users
+        will not be able to connect to individual impala daemons directly from
+        impala-shell.
+      </p>
+
+      <p>
+        In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher,
+        if you enable a proxy server in a Kerberized cluster, users have an
+        option to connect to Impala daemons directly from
+          <cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
+          <codeph>--kerberos_host_fqdn</codeph> option when you start
+          <cmdname>impala-shell</cmdname>. This option can be used for testing or
+        troubleshooting purposes, but not recommended for live production
+        environments as it defeats the purpose of a load balancer/proxy.
+      </p>
+
+      <p>
+        Example:
+<codeblock>
+impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
+</codeblock>
+      </p>
+
+      <p>
+        Alternatively, with the fully qualified
+        configurations:
+<codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock>
+      </p>
+      <p>
+        See <xref href="impala_shell_options.xml#shell_options"/> for
+        information about the option.
+      </p>
+
       <p>
         To clarify that the load-balancing proxy server is legitimate, perform
         these extra Kerberos setup steps:

http://git-wip-us.apache.org/repos/asf/impala/blob/5960d1b3/docs/topics/impala_shell_options.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_shell_options.xml b/docs/topics/impala_shell_options.xml
index d0407c9..73e2711 100644
--- a/docs/topics/impala_shell_options.xml
+++ b/docs/topics/impala_shell_options.xml
@@ -106,6 +106,35 @@ under the License.
             <row>
               <entry>
                 <p>
+                  -b or
+                </p>
+                <p>
+                  --kerberos_host_fqdn
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  kerberos_host_fqdn=
+                </p>
+                <p>
+                  <varname>load-balancer-hostname</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  If set, the setting overrides the expected hostname of the
+                  Impala daemon's Kerberos service principal.
+                    <cmdname>impala-shell</cmdname> will check that the server's
+                  principal matches this hostname. This may be used when
+                    <codeph>impalad</codeph> is configured to be accessed via a
+                  load-balancer, but it is desired for impala-shell to talk to a
+                  specific <codeph>impalad</codeph> directly.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
                   --print_header
                 </p>
               </entry>

[2/7] impala git commit: IMPALA-6372: Go parallel for Hive dataload

Posted by jo...@apache.org.

IMPALA-6372: Go parallel for Hive dataload

This changes generate-schema-statements.py to produce
separate SQL files for different file formats for Hive.
This changes load-data.py to go parallel on these
separate Hive SQL files. For correctness, the text
version of all tables must be loaded before any
of the other file formats.

load-data.py runs DDLs to create the tables in Impala
and goes parallel. Currently, there are some minor
dependencies so that text tables must be created
prior to creating the other table formats. This
changes the definitions of some tables in
testdata/datasets/functional/functional_schema_template.sql
to remove these dependencies. Now, the DDLs for the
text tables can run in parallel to the other file formats.

To unify the parallelism for Impala and Hive, load-data.py
now uses a single fixed-size pool of processes to run all
SQL files rather than spawning a thread per SQL file.

This also modifies the locations that do invalidate to
use refresh where possible and eliminate global
invalidates.

For debuggability, different SQL executions output to
different log files rather than to standard out. If an
error occurs, this will point out the relevant log
file.

This saves about 10-15 minutes on dataload (including
for GVO).

Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Reviewed-on: http://gerrit.cloudera.org:8080/8894
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/d481cd48
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/d481cd48
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/d481cd48

Branch: refs/heads/master
Commit: d481cd4842e8d92dd77cdd7f70720ff0b696dfbb
Parents: bc6c3c7
Author: Joe McDonnell <jo...@cloudera.com>
Authored: Wed Dec 20 10:29:10 2017 -0800
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Sat Apr 14 00:16:26 2018 +0000

----------------------------------------------------------------------
 bin/load-data.py                                | 395 +++++++++++++------
 testdata/bin/generate-schema-statements.py      | 140 +++++--
 testdata/bin/load_nested.py                     |  33 +-
 .../functional/functional_schema_template.sql   |  10 +-
 4 files changed, 398 insertions(+), 180 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/d481cd48/bin/load-data.py
----------------------------------------------------------------------
diff --git a/bin/load-data.py b/bin/load-data.py
index ed51487..28a504f 100755
--- a/bin/load-data.py
+++ b/bin/load-data.py
@@ -23,8 +23,10 @@
 import collections
 import getpass
 import logging
+import multiprocessing
 import os
 import re
+import shutil
 import sqlparse
 import subprocess
 import sys
@@ -32,15 +34,11 @@ import tempfile
 import time
 import traceback
 
-from itertools import product
 from optparse import OptionParser
-from Queue import Queue
 from tests.beeswax.impala_beeswax import *
-from threading import Thread
+from multiprocessing.pool import ThreadPool
 
-logging.basicConfig()
 LOG = logging.getLogger('load-data.py')
-LOG.setLevel(logging.DEBUG)
 
 parser = OptionParser()
 parser.add_option("-e", "--exploration_strategy", dest="exploration_strategy",
@@ -80,6 +78,8 @@ parser.add_option("--use_kerberos", action="store_true", default=False,
                   help="Load data on a kerberized cluster.")
 parser.add_option("--principal", default=None, dest="principal",
                   help="Kerberos service principal, required if --use_kerberos is set")
+parser.add_option("--num_processes", default=multiprocessing.cpu_count(),
+                  dest="num_processes", help="Number of parallel processes to use.")
 
 options, args = parser.parse_args()
 
@@ -111,21 +111,6 @@ if options.use_kerberos:
 HIVE_ARGS = '-n %s -u "jdbc:hive2://%s/default;%s" --verbose=true'\
     % (getpass.getuser(), options.hive_hs2_hostport, hive_auth)
 
-# When HiveServer2 is configured to use "local" mode (i.e., MR jobs are run
-# in-process rather than on YARN), Hadoop's LocalDistributedCacheManager has a
-# race, wherein it tires to localize jars into
-# /tmp/hadoop-$USER/mapred/local/<millis>. Two simultaneous Hive queries
-# against HS2 can conflict here. Weirdly LocalJobRunner handles a similar issue
-# (with the staging directory) by appending a random number. To over come this,
-# in the case that HS2 is on the local machine (which we conflate with also
-# running MR jobs locally), we move the temporary directory into a unique
-# directory via configuration. This block can be removed when
-# https://issues.apache.org/jira/browse/MAPREDUCE-6441 is resolved.
-# A similar workaround is used in tests/common/impala_test_suite.py.
-if options.hive_hs2_hostport.startswith("localhost:"):
-  HIVE_ARGS += ' --hiveconf "mapreduce.cluster.local.dir=%s"' % (tempfile.mkdtemp(
-    prefix="impala-data-load-"))
-
 HADOOP_CMD = os.path.join(os.environ['HADOOP_HOME'], 'bin/hadoop')
 
 def available_workloads(workload_dir):
@@ -135,70 +120,112 @@ def available_workloads(workload_dir):
 def validate_workloads(all_workloads, workloads):
   for workload in workloads:
     if workload not in all_workloads:
-      print 'Workload \'%s\' not found in workload directory' % workload
-      print 'Available workloads: ' + ', '.join(all_workloads)
+      LOG.error('Workload \'%s\' not found in workload directory' % workload)
+      LOG.error('Available workloads: ' + ', '.join(all_workloads))
       sys.exit(1)
 
-def exec_cmd(cmd, error_msg, exit_on_error=True):
-  ret_val = -1
-  try:
+def exec_cmd(cmd, error_msg=None, exit_on_error=True, out_file=None):
+  """Run the given command in the shell returning whether the command
+     succeeded. If 'error_msg' is set, log the error message on failure.
+     If 'exit_on_error' is True, exit the program on failure.
+     If 'out_file' is specified, log all output to that file."""
+  success = True
+  if out_file:
+    with open(out_file, 'w') as f:
+      ret_val = subprocess.call(cmd, shell=True, stderr=f, stdout=f)
+  else:
     ret_val = subprocess.call(cmd, shell=True)
-  except Exception as e:
-    error_msg = "%s: %s" % (error_msg, str(e))
-  finally:
-    if ret_val != 0:
-      print error_msg
-      if exit_on_error: sys.exit(ret_val)
-  return ret_val
-
-def exec_hive_query_from_file(file_name):
-  if not os.path.exists(file_name): return
-  hive_cmd = "%s %s -f %s" % (HIVE_CMD, HIVE_ARGS, file_name)
-  print 'Executing Hive Command: %s' % hive_cmd
-  exec_cmd(hive_cmd,  'Error executing file from Hive: ' + file_name)
+  if ret_val != 0:
+    if error_msg: LOG.info(error_msg)
+    if exit_on_error: sys.exit(ret_val)
+    success = False
+  return success
+
+def exec_hive_query_from_file_beeline(file_name):
+  if not os.path.exists(file_name):
+    LOG.info("Error: File {0} not found".format(file_name))
+    return False
+
+  LOG.info("Beginning execution of hive SQL: {0}".format(file_name))
+
+  # When HiveServer2 is configured to use "local" mode (i.e., MR jobs are run
+  # in-process rather than on YARN), Hadoop's LocalDistributedCacheManager has a
+  # race, wherein it tires to localize jars into
+  # /tmp/hadoop-$USER/mapred/local/<millis>. Two simultaneous Hive queries
+  # against HS2 can conflict here. Weirdly LocalJobRunner handles a similar issue
+  # (with the staging directory) by appending a random number. To over come this,
+  # in the case that HS2 is on the local machine (which we conflate with also
+  # running MR jobs locally), we move the temporary directory into a unique
+  # directory via configuration. This block can be removed when
+  # https://issues.apache.org/jira/browse/MAPREDUCE-6441 is resolved.
+  hive_args = HIVE_ARGS
+  unique_dir = None
+  if options.hive_hs2_hostport.startswith("localhost:"):
+    unique_dir = tempfile.mkdtemp(prefix="hive-data-load-")
+    hive_args += ' --hiveconf "mapreduce.cluster.local.dir=%s"' % unique_dir
+
+  output_file = file_name + ".log"
+  hive_cmd = "{0} {1} -f {2}".format(HIVE_CMD, hive_args, file_name)
+  is_success = exec_cmd(hive_cmd, exit_on_error=False, out_file=output_file)
+  shutil.rmtree(unique_dir)
+
+  if is_success:
+    LOG.info("Finished execution of hive SQL: {0}".format(file_name))
+  else:
+    LOG.info("Error executing hive SQL: {0} See: {1}".format(file_name, \
+             output_file))
+
+  return is_success
 
 def exec_hbase_query_from_file(file_name):
   if not os.path.exists(file_name): return
   hbase_cmd = "hbase shell %s" % file_name
-  print 'Executing HBase Command: %s' % hbase_cmd
-  exec_cmd(hbase_cmd, 'Error executing hbase create commands')
+  LOG.info('Executing HBase Command: %s' % hbase_cmd)
+  exec_cmd(hbase_cmd, error_msg='Error executing hbase create commands')
 
 # KERBEROS TODO: fails when kerberized and impalad principal isn't "impala"
 def exec_impala_query_from_file(file_name):
   """Execute each query in an Impala query file individually"""
+  if not os.path.exists(file_name):
+    LOG.info("Error: File {0} not found".format(file_name))
+    return False
+
+  LOG.info("Beginning execution of impala SQL: {0}".format(file_name))
   is_success = True
   impala_client = ImpalaBeeswaxClient(options.impalad, use_kerberos=options.use_kerberos)
-  try:
-    impala_client.connect()
-    with open(file_name, 'r+') as query_file:
-      queries = sqlparse.split(query_file.read())
-    for query in queries:
-      query = sqlparse.format(query.rstrip(';'), strip_comments=True)
-      print '(%s):\n%s\n' % (file_name, query.strip())
-      if query.strip() != "":
-        result = impala_client.execute(query)
-  except Exception as e:
-    print "Data Loading from Impala failed with error: %s" % str(e)
-    traceback.print_exc()
-    is_success = False
-  finally:
-    impala_client.close_connection()
-  return is_success
+  output_file = file_name + ".log"
+  with open(output_file, 'w') as out_file:
+    try:
+      impala_client.connect()
+      with open(file_name, 'r+') as query_file:
+        queries = sqlparse.split(query_file.read())
+        for query in queries:
+          query = sqlparse.format(query.rstrip(';'), strip_comments=True)
+          if query.strip() != "":
+            result = impala_client.execute(query)
+            out_file.write("{0}\n{1}\n".format(query, result))
+    except Exception as e:
+      out_file.write("ERROR: {0}\n".format(query))
+      traceback.print_exc(file=out_file)
+      is_success = False
 
-def exec_bash_script(file_name):
-  bash_cmd = "bash %s" % file_name
-  print 'Executing Bash Command: ' + bash_cmd
-  exec_cmd(bash_cmd, 'Error bash script: ' + file_name)
+  if is_success:
+    LOG.info("Finished execution of impala SQL: {0}".format(file_name))
+  else:
+    LOG.info("Error executing impala SQL: {0} See: {1}".format(file_name, \
+             output_file))
+
+  return is_success
 
 def run_dataset_preload(dataset):
   """Execute a preload script if present in dataset directory. E.g. to generate data
   before loading"""
   dataset_preload_script = os.path.join(DATASET_DIR, dataset, "preload")
   if os.path.exists(dataset_preload_script):
-    print("Running preload script for " + dataset)
+    LOG.info("Running preload script for " + dataset)
     if options.scale_factor > 1:
       dataset_preload_script += " " + str(options.scale_factor)
-    exec_cmd(dataset_preload_script, "Error executing preload script for " + dataset,
+    exec_cmd(dataset_preload_script, error_msg="Error executing preload script for " + dataset,
         exit_on_error=True)
 
 def generate_schema_statements(workload):
@@ -215,29 +242,29 @@ def generate_schema_statements(workload):
   if options.hdfs_namenode is not None:
     generate_cmd += " --hdfs_namenode=%s" % options.hdfs_namenode
   generate_cmd += " --backend=%s" % options.impalad
-  print 'Executing Generate Schema Command: ' + generate_cmd
+  LOG.info('Executing Generate Schema Command: ' + generate_cmd)
   schema_cmd = os.path.join(TESTDATA_BIN_DIR, generate_cmd)
   error_msg = 'Error generating schema statements for workload: ' + workload
-  exec_cmd(schema_cmd, error_msg)
+  exec_cmd(schema_cmd, error_msg=error_msg)
 
 def get_dataset_for_workload(workload):
   dimension_file_name = os.path.join(WORKLOAD_DIR, workload,
                                      '%s_dimensions.csv' % workload)
   if not os.path.isfile(dimension_file_name):
-    print 'Dimension file not found: ' + dimension_file_name
+    LOG.error('Dimension file not found: ' + dimension_file_name)
     sys.exit(1)
   with open(dimension_file_name, 'rb') as input_file:
     match = re.search('dataset:\s*([\w\-\.]+)', input_file.read())
     if match:
       return match.group(1)
     else:
-      print 'Dimension file does not contain dataset for workload \'%s\'' % (workload)
+      LOG.error('Dimension file does not contain dataset for workload \'%s\'' % (workload))
       sys.exit(1)
 
 def copy_avro_schemas_to_hdfs(schemas_dir):
   """Recursively copies all of schemas_dir to the test warehouse."""
   if not os.path.exists(schemas_dir):
-    print 'Avro schema dir (%s) does not exist. Skipping copy to HDFS.' % schemas_dir
+    LOG.info('Avro schema dir (%s) does not exist. Skipping copy to HDFS.' % schemas_dir)
     return
 
   exec_hadoop_fs_cmd("-mkdir -p " + options.hive_warehouse_dir)
@@ -245,41 +272,36 @@ def copy_avro_schemas_to_hdfs(schemas_dir):
 
 def exec_hadoop_fs_cmd(args, exit_on_error=True):
   cmd = "%s fs %s" % (HADOOP_CMD, args)
-  print "Executing Hadoop command: " + cmd
-  exec_cmd(cmd, "Error executing Hadoop command, exiting",
+  LOG.info("Executing Hadoop command: " + cmd)
+  exec_cmd(cmd, error_msg="Error executing Hadoop command, exiting",
       exit_on_error=exit_on_error)
 
-def exec_impala_query_from_file_parallel(query_files):
-  # Get the name of the query file that loads the base tables, if it exists.
-  # TODO: Find a better way to detect the file that loads the base tables.
-  create_base_table_file = next((q for q in query_files if 'text' in q), None)
-  if create_base_table_file:
-    is_success = exec_impala_query_from_file(create_base_table_file)
-    query_files.remove(create_base_table_file)
-    # If loading the base tables failed, exit with a non zero error code.
-    if not is_success: sys.exit(1)
-  if not query_files: return
-  threads = []
-  result_queue = Queue()
-  for query_file in query_files:
-    thread = Thread(target=lambda x: result_queue.put(exec_impala_query_from_file(x)),
-        args=[query_file])
-    thread.daemon = True
-    threads.append(thread)
-    thread.start()
-  # Keep looping until the number of results retrieved is the same as the number of
-  # threads spawned, or until a data loading query fails. result_queue.get() will
-  # block until a result is available in the queue.
-  num_fetched_results = 0
-  while num_fetched_results < len(threads):
-    success = result_queue.get()
-    num_fetched_results += 1
-    if not success: sys.exit(1)
-  # There is a small window where a thread may still be alive even if all the threads have
-  # finished putting their results in the queue.
-  for thread in threads: thread.join()
-
-if __name__ == "__main__":
+def exec_query_files_parallel(thread_pool, query_files, execution_type):
+  """Executes the query files provided using the execution engine specified
+     in parallel using the given thread pool. Aborts immediately if any execution
+     encounters an error."""
+  assert(execution_type == 'impala' or execution_type == 'hive')
+  if len(query_files) == 0: return
+  if execution_type == 'impala':
+    execution_function = exec_impala_query_from_file
+  elif execution_type == 'hive':
+    execution_function = exec_hive_query_from_file_beeline
+
+  for result in thread_pool.imap_unordered(execution_function, query_files):
+    if not result:
+      thread_pool.terminate()
+      sys.exit(1)
+
+def impala_exec_query_files_parallel(thread_pool, query_files):
+  exec_query_files_parallel(thread_pool, query_files, 'impala')
+
+def hive_exec_query_files_parallel(thread_pool, query_files):
+  exec_query_files_parallel(thread_pool, query_files, 'hive')
+
+def main():
+  logging.basicConfig(format='%(asctime)s %(message)s', datefmt='%H:%M:%S')
+  LOG.setLevel(logging.DEBUG)
+
   # Having the actual command line at the top of each data-load-* log can help
   # when debugging dataload issues.
   #
@@ -288,62 +310,185 @@ if __name__ == "__main__":
   all_workloads = available_workloads(WORKLOAD_DIR)
   workloads = []
   if options.workloads is None:
-    print "At least one workload name must be specified."
+    LOG.error("At least one workload name must be specified.")
     parser.print_help()
     sys.exit(1)
   elif options.workloads == 'all':
-    print 'Loading data for all workloads.'
+    LOG.info('Loading data for all workloads.')
     workloads = all_workloads
   else:
     workloads = options.workloads.split(",")
     validate_workloads(all_workloads, workloads)
 
-  print 'Starting data load for the following workloads: ' + ', '.join(workloads)
+  LOG.info('Starting data load for the following workloads: ' + ', '.join(workloads))
+  LOG.info('Running with {0} threads'.format(options.num_processes))
 
+  # Note: The processes are in whatever the caller's directory is, so all paths
+  #       passed to the pool need to be absolute paths. This will allow the pool
+  #       to be used for different workloads (and thus different directories)
+  #       simultaneously.
+  thread_pool = ThreadPool(processes=options.num_processes)
   loading_time_map = collections.defaultdict(float)
   for workload in workloads:
     start_time = time.time()
     dataset = get_dataset_for_workload(workload)
     run_dataset_preload(dataset)
+    # This script is tightly coupled with testdata/bin/generate-schema-statements.py
+    # Specifically, this script is expecting the following:
+    # 1. generate-schema-statements.py generates files and puts them in the
+    #    directory ${IMPALA_DATA_LOADING_SQL_DIR}/${workload}
+    #    (e.g. ${IMPALA_HOME}/logs/data_loading/sql/tpch)
+    # 2. generate-schema-statements.py populates the subdirectory
+    #    avro_schemas/${workload} with JSON files specifying the Avro schema for the
+    #    tables being loaded.
+    # 3. generate-schema-statements.py uses a particular naming scheme to distinguish
+    #    between SQL files of different load phases.
+    #
+    #    Using the following variables:
+    #    workload_exploration = ${workload}-${exploration_strategy} and
+    #    file_format_suffix = ${file_format}-${codec}-${compression_type}
+    #
+    #    A. Impala table creation scripts run in Impala to create tables, partitions,
+    #       and views. There is one for each file format. They take the form:
+    #       create-${workload_exploration}-impala-generated-${file_format_suffix}.sql
+    #
+    #    B. Hive creation/load scripts run in Hive to load data into tables and create
+    #       tables or views that Impala does not support. There is one for each
+    #       file format. They take the form:
+    #       load-${workload_exploration}-hive-generated-${file_format_suffix}.sql
+    #
+    #    C. HBase creation script runs through the hbase commandline to create
+    #       HBase tables. (Only generated if loading HBase table.) It takes the form:
+    #       load-${workload_exploration}-hbase-generated.create
+    #
+    #    D. HBase postload script runs through the hbase commandline to flush
+    #       HBase tables. (Only generated if loading HBase table.) It takes the form:
+    #       post-load-${workload_exploration}-hbase-generated.sql
+    #
+    #    E. Impala load scripts run in Impala to load data. Only Parquet and Kudu
+    #       are loaded through Impala. There is one for each of those formats loaded.
+    #       They take the form:
+    #       load-${workload_exploration}-impala-generated-${file_format_suffix}.sql
+    #
+    #    F. Invalidation script runs through Impala to invalidate/refresh metadata
+    #       for tables. It takes the form:
+    #       invalidate-${workload_exploration}-impala-generated.sql
     generate_schema_statements(workload)
+
+    # Determine the directory from #1
     sql_dir = os.path.join(SQL_OUTPUT_DIR, dataset)
     assert os.path.isdir(sql_dir),\
       ("Could not find the generated SQL files for loading dataset '%s'.\
         \nExpected to find the SQL files in: %s" % (dataset, sql_dir))
-    os.chdir(os.path.join(SQL_OUTPUT_DIR, dataset))
-    copy_avro_schemas_to_hdfs(AVRO_SCHEMA_DIR)
-    dataset_dir_contents = os.listdir(os.getcwd())
-    load_file_substr = "%s-%s" % (workload, options.exploration_strategy)
-    # Data loading with Impala is done in parallel, each file format has a separate query
-    # file.
-    create_filename = 'create-%s-impala-generated' % load_file_substr
-    load_filename = 'load-%s-impala-generated' % load_file_substr
-    impala_create_files = [f for f in dataset_dir_contents if create_filename in f]
-    impala_load_files = [f for f in dataset_dir_contents if load_filename in f]
+
+    # Copy the avro schemas (see #2) into HDFS
+    avro_schemas_path = os.path.join(sql_dir, AVRO_SCHEMA_DIR)
+    copy_avro_schemas_to_hdfs(avro_schemas_path)
+
+    # List all of the files in the sql directory to sort out the various types of
+    # files (see #3).
+    dataset_dir_contents = [os.path.join(sql_dir, f) for f in os.listdir(sql_dir)]
+    workload_exploration = "%s-%s" % (workload, options.exploration_strategy)
+
+    # Remove the AVRO_SCHEMA_DIR from the list of files
+    if os.path.exists(avro_schemas_path):
+      dataset_dir_contents.remove(avro_schemas_path)
+
+    # Match for Impala create files (3.A)
+    impala_create_match = 'create-%s-impala-generated' % workload_exploration
+    # Match for Hive create/load files (3.B)
+    hive_load_match = 'load-%s-hive-generated' % workload_exploration
+    # Match for HBase creation script (3.C)
+    hbase_create_match = 'load-%s-hbase-generated.create' % workload_exploration
+    # Match for HBase post-load script (3.D)
+    hbase_postload_match = 'post-load-%s-hbase-generated.sql' % workload_exploration
+    # Match for Impala load scripts (3.E)
+    impala_load_match = 'load-%s-impala-generated' % workload_exploration
+    # Match for Impala invalidate script (3.F)
+    invalidate_match = 'invalidate-%s-impala-generated' % workload_exploration
+
+    impala_create_files = []
+    hive_load_text_files = []
+    hive_load_nontext_files = []
+    hbase_create_files = []
+    hbase_postload_files = []
+    impala_load_files = []
+    invalidate_files = []
+    for filename in dataset_dir_contents:
+      if impala_create_match in filename:
+        impala_create_files.append(filename)
+      elif hive_load_match in filename:
+        if 'text-none-none' in filename:
+          hive_load_text_files.append(filename)
+        else:
+          hive_load_nontext_files.append(filename)
+      elif hbase_create_match in filename:
+        hbase_create_files.append(filename)
+      elif hbase_postload_match in filename:
+        hbase_postload_files.append(filename)
+      elif impala_load_match in filename:
+        impala_load_files.append(filename)
+      elif invalidate_match in filename:
+        invalidate_files.append(filename)
+      else:
+        assert False, "Unexpected input file {0}".format(filename)
+
+    # Simple helper function to dump a header followed by the filenames
+    def log_file_list(header, file_list):
+      if (len(file_list) == 0): return
+      LOG.debug(header)
+      map(LOG.debug, map(os.path.basename, file_list))
+      LOG.debug("\n")
+
+    log_file_list("Impala Create Files:", impala_create_files)
+    log_file_list("Hive Load Text Files:", hive_load_text_files)
+    log_file_list("Hive Load Non-Text Files:", hive_load_nontext_files)
+    log_file_list("HBase Create Files:", hbase_create_files)
+    log_file_list("HBase Post-Load Files:", hbase_postload_files)
+    log_file_list("Impala Load Files:", impala_load_files)
+    log_file_list("Impala Invalidate Files:", invalidate_files)
 
     # Execute the data loading scripts.
     # Creating tables in Impala has no dependencies, so we execute them first.
     # HBase table inserts are done via hive, so the hbase tables need to be created before
-    # running the hive script. Some of the Impala inserts depend on hive tables,
+    # running the hive scripts. Some of the Impala inserts depend on hive tables,
     # so they're done at the end. Finally, the Hbase Tables that have been filled with data
     # need to be flushed.
-    exec_impala_query_from_file_parallel(impala_create_files)
-    exec_hbase_query_from_file('load-%s-hbase-generated.create' % load_file_substr)
-    exec_hive_query_from_file('load-%s-hive-generated.sql' % load_file_substr)
-    exec_hbase_query_from_file('post-load-%s-hbase-generated.sql' % load_file_substr)
+
+    impala_exec_query_files_parallel(thread_pool, impala_create_files)
+
+    # There should be at most one hbase creation script
+    assert(len(hbase_create_files) <= 1)
+    for hbase_create in hbase_create_files:
+      exec_hbase_query_from_file(hbase_create)
+
+    # If this is loading text tables plus multiple other formats, the text tables
+    # need to be loaded first
+    assert(len(hive_load_text_files) <= 1)
+    hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
+    hive_exec_query_files_parallel(thread_pool, hive_load_nontext_files)
+
+    assert(len(hbase_postload_files) <= 1)
+    for hbase_postload in hbase_postload_files:
+      exec_hbase_query_from_file(hbase_postload)
 
     # Invalidate so that Impala sees the loads done by Hive before loading Parquet/Kudu
     # Note: This only invalidates tables for this workload.
-    invalidate_sql_file = 'invalidate-{0}-impala-generated.sql'.format(load_file_substr)
-    if impala_load_files: exec_impala_query_from_file(invalidate_sql_file)
-    exec_impala_query_from_file_parallel(impala_load_files)
+    assert(len(invalidate_files) <= 1)
+    if impala_load_files:
+      impala_exec_query_files_parallel(thread_pool, invalidate_files)
+      impala_exec_query_files_parallel(thread_pool, impala_load_files)
     # Final invalidate for this workload
-    exec_impala_query_from_file(invalidate_sql_file)
+    impala_exec_query_files_parallel(thread_pool, invalidate_files)
     loading_time_map[workload] = time.time() - start_time
 
   total_time = 0.0
+  thread_pool.close()
+  thread_pool.join()
   for workload, load_time in loading_time_map.iteritems():
     total_time += load_time
-    print 'Data loading for workload \'%s\' completed in: %.2fs'\
-        % (workload, load_time)
-  print 'Total load time: %.2fs\n' % total_time
+    LOG.info('Data loading for workload \'%s\' completed in: %.2fs'\
+        % (workload, load_time))
+  LOG.info('Total load time: %.2fs\n' % total_time)
+
+if __name__ == "__main__": main()

http://git-wip-us.apache.org/repos/asf/impala/blob/d481cd48/testdata/bin/generate-schema-statements.py
----------------------------------------------------------------------
diff --git a/testdata/bin/generate-schema-statements.py b/testdata/bin/generate-schema-statements.py
index 3f730e6..e039c48 100755
--- a/testdata/bin/generate-schema-statements.py
+++ b/testdata/bin/generate-schema-statements.py
@@ -16,30 +16,84 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-
-# This script generates the "CREATE TABLE", "INSERT", and "LOAD" statements for loading
-# test data and writes them to create-*-generated.sql and
-# load-*-generated.sql. These files are then executed by hive or impala, depending
-# on their contents. Additionally, for hbase, the file is of the form
-# create-*hbase*-generated.create.
 #
-# The statements that are generated are based on an input test vector
-# (read from a file) that describes the coverage desired. For example, currently
-# we want to run benchmarks with different data sets, across different file types, and
-# with different compression algorithms set. To improve data loading performance this
-# script will generate an INSERT INTO statement to generate the data if the file does
-# not already exist in HDFS. If the file does already exist in HDFS then we simply issue a
-# LOAD statement which is much faster.
+# This script generates statements to create and populate
+# tables in a variety of formats. The tables and formats are
+# defined through a combination of files:
+# 1. Workload format specifics specify for each workload
+#    which formats are part of core, exhaustive, etc.
+#    This operates via the normal test dimensions.
+#    (see tests/common/test_dimension.py and
+#     testdata/workloads/*/*.csv)
+# 2. Workload table availability constraints specify which
+#    tables exist for which formats.
+#    (see testdata/datasets/*/schema_constraints.csv)
+# The arguments to this script specify the workload and
+# exploration strategy and can optionally restrict it
+# further to individual tables.
+#
+# This script is generating several SQL scripts to be
+# executed by bin/load-data.py. The two scripts are tightly
+# coupled and any change in files generated must be
+# reflected in bin/load-data.py. Currently, this script
+# generates three things:
+# 1. It creates the directory (destroying the existing
+#    directory if necessary)
+#    ${IMPALA_DATA_LOADING_SQL_DIR}/${workload}
+# 2. It creates and populates a subdirectory
+#    avro_schemas/${workload} with JSON files specifying
+#    the Avro schema for each table.
+# 3. It generates SQL files with the following naming schema:
+#
+#    Using the following variables:
+#    workload_exploration = ${workload}-${exploration_strategy} and
+#    file_format_suffix = ${file_format}-${codec}-${compression_type}
+#
+#    A. Impala table creation scripts run in Impala to create tables, partitions,
+#       and views. There is one for each file format. They take the form:
+#       create-${workload_exploration}-impala-generated-${file_format_suffix}.sql
+#
+#    B. Hive creation/load scripts run in Hive to load data into tables and create
+#       tables or views that Impala does not support. There is one for each
+#       file format. They take the form:
+#       load-${workload_exploration}-hive-generated-${file_format_suffix}.sql
+#
+#    C. HBase creation script runs through the hbase commandline to create
+#       HBase tables. (Only generated if loading HBase table.) It takes the form:
+#       load-${workload_exploration}-hbase-generated.create
+#
+#    D. HBase postload script runs through the hbase commandline to flush
+#       HBase tables. (Only generated if loading HBase table.) It takes the form:
+#       post-load-${workload_exploration}-hbase-generated.sql
 #
-# The input test vectors are generated via the generate_test_vectors.py so
-# ensure that script has been run (or the test vector files already exist) before
-# running this script.
+#    E. Impala load scripts run in Impala to load data. Only Parquet and Kudu
+#       are loaded through Impala. There is one for each of those formats loaded.
+#       They take the form:
+#       load-${workload_exploration}-impala-generated-${file_format_suffix}.sql
+#
+#    F. Invalidation script runs through Impala to invalidate/refresh metadata
+#       for tables. It takes the form:
+#       invalidate-${workload_exploration}-impala-generated.sql
+#
+# In summary, table "CREATE" statements are mostly done by Impala. Any "CREATE"
+# statements that Impala does not support are done through Hive. Loading data
+# into tables mostly runs in Hive except for Parquet and Kudu tables.
+# Loading proceeds in two parts: First, data is loaded into text tables.
+# Second, almost all other formats are populated by inserts from the text
+# table. Since data loaded in Hive may not be visible in Impala, all tables
+# need to have metadata refreshed or invalidated before access in Impala.
+# This means that loading Parquet or Kudu requires invalidating source
+# tables. It also means that invalidate needs to happen at the end of dataload.
+#
+# For tables requiring customized actions to create schemas or place data,
+# this script allows the table specification to include commands that
+# this will execute as part of generating the SQL for table. If the command
+# generates output, that output is used for that section. This is useful
+# for custom tables that rely on loading specific files into HDFS or
+# for tables where specifying the schema is tedious (e.g. wide tables).
+# This should be used sparingly, because these commands are executed
+# serially.
 #
-# Note: This statement generation is assuming the following data loading workflow:
-# 1) Load all the data in the specified source table
-# 2) Create tables for the new file formats and compression types
-# 3) Run INSERT OVERWRITE TABLE SELECT * from the source table into the new tables
-#    or LOAD directly if the file already exists in HDFS.
 import collections
 import csv
 import glob
@@ -171,7 +225,7 @@ KNOWN_EXPLORATION_STRATEGIES = ['core', 'pairwise', 'exhaustive', 'lzo']
 def build_create_statement(table_template, table_name, db_name, db_suffix,
                            file_format, compression, hdfs_location,
                            force_reload):
-  create_stmt = 'CREATE DATABASE IF NOT EXISTS %s%s;\n' % (db_name, db_suffix)
+  create_stmt = ''
   if (force_reload):
     create_stmt += 'DROP TABLE IF EXISTS %s%s.%s;\n' % (db_name, db_suffix, table_name)
   if compression == 'lzo':
@@ -453,13 +507,13 @@ class Statements(object):
 
   def write_to_file(self, filename):
     # If there is no content to write, skip
-    if self.__is_empty(): return
+    if not self: return
     output = self.create + self.load_base + self.load
     with open(filename, 'w') as f:
       f.write('\n\n'.join(output))
 
-  def __is_empty(self):
-    return not (self.create or self.load or self.load_base)
+  def __nonzero__(self):
+    return bool(self.create or self.load or self.load_base)
 
 def eval_section(section_str):
   """section_str should be the contents of a section (i.e. a string). If section_str
@@ -481,7 +535,6 @@ def generate_statements(output_name, test_vectors, sections,
   # TODO: This method has become very unwieldy. It has to be re-factored sooner than
   # later.
   # Parquet statements to be executed separately by Impala
-  hive_output = Statements()
   hbase_output = Statements()
   hbase_post_load = Statements()
   impala_invalidate = Statements()
@@ -492,16 +545,18 @@ def generate_statements(output_name, test_vectors, sections,
   existing_tables = get_hdfs_subdirs_with_data(options.hive_warehouse_dir)
   for row in test_vectors:
     impala_create = Statements()
+    hive_output = Statements()
     impala_load = Statements()
     file_format, data_set, codec, compression_type =\
         [row.file_format, row.dataset, row.compression_codec, row.compression_type]
     table_format = '%s/%s/%s' % (file_format, codec, compression_type)
+    db_suffix = row.db_suffix()
+    db_name = '{0}{1}'.format(data_set, options.scale_factor)
+    db = '{0}{1}'.format(db_name, db_suffix)
+    create_db_stmt = 'CREATE DATABASE IF NOT EXISTS {0};\n'.format(db)
+    impala_create.create.append(create_db_stmt)
     for section in sections:
       table_name = section['BASE_TABLE_NAME'].strip()
-      db_suffix = row.db_suffix()
-      db_name = '{0}{1}'.format(data_set, options.scale_factor)
-      db = '{0}{1}'.format(db_name, db_suffix)
-
 
       if table_names and (table_name.lower() not in table_names):
         print 'Skipping table: %s.%s, table is not in specified table list' % (db, table_name)
@@ -640,8 +695,13 @@ def generate_statements(output_name, test_vectors, sections,
             column_families))
         hbase_post_load.load.append("flush '%s_hbase.%s'\n" % (db_name, table_name))
 
-      # Need to emit an "invalidate metadata" for each individual table
-      invalidate_table_stmt = "INVALIDATE METADATA {0}.{1};\n".format(db, table_name)
+      # Need to make sure that tables created and/or data loaded in Hive is seen
+      # in Impala. We only need to do a full invalidate if the table was created in Hive
+      # and Impala doesn't know about it. Otherwise, do a refresh.
+      if output == hive_output:
+        invalidate_table_stmt = "INVALIDATE METADATA {0}.{1};\n".format(db, table_name)
+      else:
+        invalidate_table_stmt = "REFRESH {0}.{1};\n".format(db, table_name)
       impala_invalidate.create.append(invalidate_table_stmt)
 
       # The ALTER statement in hive does not accept fully qualified table names so
@@ -701,16 +761,18 @@ def generate_statements(output_name, test_vectors, sections,
 
     impala_create.write_to_file("create-%s-impala-generated-%s-%s-%s.sql" %
         (output_name, file_format, codec, compression_type))
+    hive_output.write_to_file("load-%s-hive-generated-%s-%s-%s.sql" %
+        (output_name, file_format, codec, compression_type))
     impala_load.write_to_file("load-%s-impala-generated-%s-%s-%s.sql" %
         (output_name, file_format, codec, compression_type))
 
-
-  hive_output.write_to_file('load-' + output_name + '-hive-generated.sql')
-  hbase_output.create.append("exit")
-  hbase_output.write_to_file('load-' + output_name + '-hbase-generated.create')
-  hbase_post_load.load.append("exit")
-  hbase_post_load.write_to_file('post-load-' + output_name + '-hbase-generated.sql')
-  impala_invalidate.write_to_file('invalidate-' + output_name + '-impala-generated.sql')
+  if hbase_output:
+    hbase_output.create.append("exit")
+    hbase_output.write_to_file('load-' + output_name + '-hbase-generated.create')
+  if hbase_post_load:
+    hbase_post_load.load.append("exit")
+    hbase_post_load.write_to_file('post-load-' + output_name + '-hbase-generated.sql')
+  impala_invalidate.write_to_file("invalidate-" + output_name + "-impala-generated.sql")
 
 def parse_schema_template_file(file_name):
   VALID_SECTION_NAMES = ['DATASET', 'BASE_TABLE_NAME', 'COLUMNS', 'PARTITION_COLUMNS',

http://git-wip-us.apache.org/repos/asf/impala/blob/d481cd48/testdata/bin/load_nested.py
----------------------------------------------------------------------
diff --git a/testdata/bin/load_nested.py b/testdata/bin/load_nested.py
index 146c0ff..d391fdb 100755
--- a/testdata/bin/load_nested.py
+++ b/testdata/bin/load_nested.py
@@ -263,32 +263,43 @@ def load():
         TBLPROPERTIES('parquet.compression'='SNAPPY')
         AS SELECT * FROM tmp_customer;
 
-        DROP TABLE tmp_orders_string;
-        DROP TABLE tmp_customer_string;
-        DROP TABLE tmp_customer;
-
         CREATE TABLE region
         STORED AS PARQUET
         TBLPROPERTIES('parquet.compression'='SNAPPY')
         AS SELECT * FROM tmp_region;
 
-        DROP TABLE tmp_region_string;
-        DROP TABLE tmp_region;
-
         CREATE TABLE supplier
         STORED AS PARQUET
         TBLPROPERTIES('parquet.compression'='SNAPPY')
-        AS SELECT * FROM tmp_supplier;
+        AS SELECT * FROM tmp_supplier;""".split(";"):
+      if not stmt.strip():
+        continue
+      LOG.info("Executing: {0}".format(stmt))
+      hive.execute(stmt)
+
+  with cluster.impala.cursor(db_name=target_db) as impala:
+    # Drop the temporary tables. These temporary tables were created
+    # in Impala, so they exist in Impala's metadata. This drop is executed by
+    # Impala so that the metadata is automatically updated.
+    for stmt in """
+        DROP TABLE tmp_orders_string;
+        DROP TABLE tmp_customer_string;
+        DROP TABLE tmp_customer;
+
+        DROP TABLE tmp_region_string;
+        DROP TABLE tmp_region;
 
         DROP TABLE tmp_supplier;
         DROP TABLE tmp_supplier_string;""".split(";"):
       if not stmt.strip():
         continue
       LOG.info("Executing: {0}".format(stmt))
-      hive.execute(stmt)
+      impala.execute(stmt)
 
-  with cluster.impala.cursor(db_name=target_db) as impala:
-    impala.invalidate_metadata()
+    impala.invalidate_metadata(table_name="customer")
+    impala.invalidate_metadata(table_name="part")
+    impala.invalidate_metadata(table_name="region")
+    impala.invalidate_metadata(table_name="supplier")
     impala.compute_stats()
 
   LOG.info("Done loading nested TPCH data")

http://git-wip-us.apache.org/repos/asf/impala/blob/d481cd48/testdata/datasets/functional/functional_schema_template.sql
----------------------------------------------------------------------
diff --git a/testdata/datasets/functional/functional_schema_template.sql b/testdata/datasets/functional/functional_schema_template.sql
index a7a5eac..be666ee 100644
--- a/testdata/datasets/functional/functional_schema_template.sql
+++ b/testdata/datasets/functional/functional_schema_template.sql
@@ -242,16 +242,16 @@ functional
 ---- BASE_TABLE_NAME
 alltypesinsert
 ---- CREATE
-CREATE TABLE IF NOT EXISTS {db_name}{db_suffix}.{table_name} LIKE {db_name}.alltypes
-STORED AS {file_format};
+CREATE TABLE IF NOT EXISTS {db_name}{db_suffix}.{table_name}
+LIKE {db_name}{db_suffix}.alltypes STORED AS {file_format};
 ====
 ---- DATASET
 functional
 ---- BASE_TABLE_NAME
 alltypesnopart_insert
 ---- CREATE
-CREATE TABLE IF NOT EXISTS {db_name}{db_suffix}.{table_name} like {db_name}.alltypesnopart
-STORED AS {file_format};
+CREATE TABLE IF NOT EXISTS {db_name}{db_suffix}.{table_name}
+LIKE {db_name}{db_suffix}.alltypesnopart STORED AS {file_format};
 ====
 ---- DATASET
 functional
@@ -2009,7 +2009,7 @@ functional
 ---- BASE_TABLE_NAME
 avro_unicode_nulls
 ---- CREATE_HIVE
-create external table if not exists {db_name}{db_suffix}.{table_name} like {db_name}.liketbl stored as avro LOCATION '/test-warehouse/avro_null_char';
+create external table if not exists {db_name}{db_suffix}.{table_name} like {db_name}{db_suffix}.liketbl stored as avro LOCATION '/test-warehouse/avro_null_char';
 ---- LOAD
 `hdfs dfs -mkdir -p /test-warehouse/avro_null_char && \
 hdfs dfs -put -f ${IMPALA_HOME}/testdata/avro_null_char/000000_0 /test-warehouse/avro_null_char/

[3/7] impala git commit: IMPALA-6463: [DOCS] Removed query options were deleted from docs

Posted by jo...@apache.org.

IMPALA-6463: [DOCS] Removed query options were deleted from docs

The following query option docs and references were removed from docs:
- DEFAULT_ORDER_BY_LIMIT
- ABORT_ON_DEFAULT_LIMIT_EXCEEDED
- V_CPU_CORES (previously removed)
- RESERVATION_REQUEST_TIMEOUT (previously removed)
- RM_INITIAL_MEM
- SCAN_NODE_CODEGEN_THRESHOLD
- MAX_IO_BUFFERS
- DISABLE_CACHED_READS

Change-Id: I71be3f872468cb22583f82c2238bf72dc82cb750
Cherry-picks: not for 2.x.
Reviewed-on: http://gerrit.cloudera.org:8080/10055
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/b4228dfd
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/b4228dfd
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/b4228dfd

Branch: refs/heads/master
Commit: b4228dfd14975059b2aac68792e63007b5c4e1e2
Parents: d481cd4
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Apr 12 16:31:32 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Sat Apr 14 00:28:46 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                             |  6 --
 docs/impala_keydefs.ditamap                     |  6 --
 .../impala_abort_on_default_limit_exceeded.xml  | 41 ---------
 docs/topics/impala_default_order_by_limit.xml   | 55 ------------
 docs/topics/impala_disable_cached_reads.xml     | 54 ------------
 docs/topics/impala_max_io_buffers.xml           | 49 -----------
 docs/topics/impala_order_by.xml                 |  9 --
 docs/topics/impala_rm_initial_mem.xml           | 47 ----------
 .../impala_scan_node_codegen_threshold.xml      | 93 --------------------
 9 files changed, 360 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 89d9553..e6c9c09 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -170,7 +170,6 @@ under the License.
       </topicref>
       <topicref href="topics/impala_set.xml">
         <topicref href="topics/impala_query_options.xml">
-          <topicref href="topics/impala_abort_on_default_limit_exceeded.xml"/>
           <topicref href="topics/impala_abort_on_error.xml"/>
           <topicref href="topics/impala_allow_unsupported_formats.xml"/>
           <topicref href="topics/impala_appx_count_distinct.xml"/>
@@ -180,9 +179,7 @@ under the License.
           <topicref href="topics/impala_debug_action.xml"/>
           <topicref rev="2.9.0" href="topics/impala_decimal_v2.xml"/>
           <topicref rev="2.9.0 IMPALA-5381" href="topics/impala_default_join_distribution_mode.xml"/>
-          <topicref href="topics/impala_default_order_by_limit.xml"/>
           <topicref rev="2.10.0 IMPALA-3200" href="topics/impala_default_spillable_buffer_size.xml"/>
-          <topicref audience="hidden" href="topics/impala_disable_cached_reads.xml"/>
           <topicref href="topics/impala_disable_codegen.xml"/>
           <topicref audience="hidden" href="topics/impala_disable_outermost_topn.xml"/>
           <topicref rev="2.5.0" href="topics/impala_disable_row_runtime_filtering.xml"/>
@@ -195,7 +192,6 @@ under the License.
           <topicref href="topics/impala_live_progress.xml"/>
           <topicref href="topics/impala_live_summary.xml"/>
           <topicref href="topics/impala_max_errors.xml"/>
-          <topicref href="topics/impala_max_io_buffers.xml"/>
           <topicref rev="2.10.0 IMPALA-3200" href="topics/impala_max_row_size.xml"/>
           <topicref rev="2.5.0" href="topics/impala_max_num_runtime_filters.xml"/>
           <topicref href="topics/impala_max_scan_range_length.xml"/>
@@ -214,14 +210,12 @@ under the License.
           <topicref href="topics/impala_query_timeout_s.xml"/>
           <topicref href="topics/impala_request_pool.xml"/>
           <topicref rev="2.7.0" href="topics/impala_replica_preference.xml"/>
-          <topicref audience="hidden" href="topics/impala_rm_initial_mem.xml"/>
           <topicref rev="2.5.0" href="topics/impala_runtime_bloom_filter_size.xml"/>
           <topicref rev="2.6.0" href="topics/impala_runtime_filter_max_size.xml"/>
           <topicref rev="2.6.0" href="topics/impala_runtime_filter_min_size.xml"/>
           <topicref rev="2.5.0" href="topics/impala_runtime_filter_mode.xml"/>
           <topicref rev="2.5.0" href="topics/impala_runtime_filter_wait_time_ms.xml"/>
           <topicref rev="2.6.0" href="topics/impala_s3_skip_insert_staging.xml"/>
-          <topicref rev="2.5.0" href="topics/impala_scan_node_codegen_threshold.xml"/>
           <topicref rev="2.5.0" href="topics/impala_schedule_random_replica.xml"/>
           <topicref rev="2.8.0 IMPALA-3671" href="topics/impala_scratch_limit.xml"/>
           <!-- This option is for internal use only and might go away without ever being documented. -->

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/impala_keydefs.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap
index be64501..0a2ac82 100644
--- a/docs/impala_keydefs.ditamap
+++ b/docs/impala_keydefs.ditamap
@@ -10768,7 +10768,6 @@ under the License.
   <keydef href="topics/impala_hints.xml" keys="hints"/>
   <keydef href="topics/impala_set.xml" keys="set"/>
   <keydef href="topics/impala_query_options.xml" keys="query_options"/>
-  <keydef href="topics/impala_abort_on_default_limit_exceeded.xml" keys="abort_on_default_limit_exceeded"/>
   <keydef href="topics/impala_abort_on_error.xml" keys="abort_on_error"/>
   <keydef href="topics/impala_allow_unsupported_formats.xml" keys="allow_unsupported_formats"/>
   <keydef href="topics/impala_appx_count_distinct.xml" keys="appx_count_distinct"/>
@@ -10777,9 +10776,7 @@ under the License.
   <keydef href="topics/impala_compression_codec.xml" keys="compression_codec"/>
   <keydef href="topics/impala_debug_action.xml" keys="debug_action"/>
   <keydef href="topics/impala_default_join_distribution_mode.xml" keys="default_join_distribution_mode"/>
-  <keydef href="topics/impala_default_order_by_limit.xml" keys="default_order_by_limit"/>
   <keydef rev="2.10.0 IMPALA-3200" href="topics/impala_default_spillable_buffer_size.xml" keys="default_spillable_buffer_size"/>
-  <keydef href="topics/impala_disable_cached_reads.xml" keys="disable_cached_reads"/>
   <keydef href="topics/impala_disable_codegen.xml" keys="disable_codegen"/>
   <keydef href="topics/impala_disable_outermost_topn.xml" keys="disable_outermost_topn"/>
   <keydef href="topics/impala_disable_row_runtime_filtering.xml" keys="disable_row_runtime_filtering"/>
@@ -10792,7 +10789,6 @@ under the License.
   <keydef href="topics/impala_live_progress.xml" keys="live_progress"/>
   <keydef href="topics/impala_live_summary.xml" keys="live_summary"/>
   <keydef href="topics/impala_max_errors.xml" keys="max_errors"/>
-  <keydef href="topics/impala_max_io_buffers.xml" keys="max_io_buffers"/>
   <keydef rev="2.10.0 IMPALA-3200" href="topics/impala_max_row_size.xml" keys="max_row_size"/>
   <keydef href="topics/impala_max_scan_range_length.xml" keys="max_scan_range_length"/>
   <keydef href="topics/impala_max_num_runtime_filters.xml" keys="max_num_runtime_filters"/>
@@ -10812,14 +10808,12 @@ under the License.
   <keydef href="topics/impala_request_pool.xml" keys="request_pool"/>
   <keydef href="topics/impala_schedule_random_replica.xml" keys="schedule_random_replica"/>
   <keydef href="topics/impala_replica_preference.xml" keys="replica_preference"/>
-  <keydef href="topics/impala_rm_initial_mem.xml" keys="rm_initial_mem"/>
   <keydef href="topics/impala_runtime_bloom_filter_size.xml" keys="runtime_bloom_filter_size"/>
   <keydef href="topics/impala_runtime_filter_max_size.xml" keys="runtime_filter_max_size"/>
   <keydef href="topics/impala_runtime_filter_min_size.xml" keys="runtime_filter_min_size"/>
   <keydef href="topics/impala_runtime_filter_mode.xml" keys="runtime_filter_mode"/>
   <keydef href="topics/impala_runtime_filter_wait_time_ms.xml" keys="runtime_filter_wait_time_ms"/>
   <keydef href="topics/impala_s3_skip_insert_staging.xml" keys="s3_skip_insert_staging"/>
-  <keydef href="topics/impala_scan_node_codegen_threshold.xml" keys="scan_node_codegen_threshold"/>
   <keydef href="topics/impala_scratch_limit.xml" keys="scratch_limit"/>
   <keydef href="topics/impala_support_start_over.xml" keys="support_start_over"/>
   <keydef href="topics/impala_sync_ddl.xml" keys="sync_ddl"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_abort_on_default_limit_exceeded.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_abort_on_default_limit_exceeded.xml b/docs/topics/impala_abort_on_default_limit_exceeded.xml
deleted file mode 100644
index ec6973b..0000000
--- a/docs/topics/impala_abort_on_default_limit_exceeded.xml
+++ /dev/null
@@ -1,41 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept rev="obwl" id="abort_on_default_limit_exceeded">
-
-  <title>ABORT_ON_DEFAULT_LIMIT_EXCEEDED Query Option</title>
-  <titlealts audience="PDF"><navtitle>ABORT_ON_DEFAULT_LIMIT_EXCEEDED</navtitle></titlealts>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p conref="../shared/impala_common.xml#common/obwl_query_options"/>
-
-    <p conref="../shared/impala_common.xml#common/type_boolean"/>
-    <p conref="../shared/impala_common.xml#common/default_false_0"/>
-  </conbody>
-</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_default_order_by_limit.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_default_order_by_limit.xml b/docs/topics/impala_default_order_by_limit.xml
deleted file mode 100644
index 5c72126..0000000
--- a/docs/topics/impala_default_order_by_limit.xml
+++ /dev/null
@@ -1,55 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept rev="obwl" id="default_order_by_limit">
-
-  <title>DEFAULT_ORDER_BY_LIMIT Query Option</title>
-  <titlealts audience="PDF"><navtitle>DEFAULT_ORDER_BY_LIMIT</navtitle></titlealts>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p conref="../shared/impala_common.xml#common/obwl_query_options"/>
-
-    <p rev="1.4.0">
-      Prior to Impala 1.4.0, Impala queries that use the <codeph><xref href="impala_order_by.xml#order_by">ORDER
-      BY</xref></codeph> clause must also include a
-      <codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause, to avoid accidentally producing
-      huge result sets that must be sorted. Sorting a huge result set is a memory-intensive operation. In Impala
-      1.4.0 and higher, Impala uses a temporary disk work area to perform the sort if that operation would
-      otherwise exceed the Impala memory limit on a particular host.
-    </p>
-
-    <p>
-      <b>Type: numeric</b>
-    </p>
-
-    <p>
-      <b>Default:</b> -1 (no default limit)
-    </p>
-  </conbody>
-</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_disable_cached_reads.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_disable_cached_reads.xml b/docs/topics/impala_disable_cached_reads.xml
deleted file mode 100644
index 20391db..0000000
--- a/docs/topics/impala_disable_cached_reads.xml
+++ /dev/null
@@ -1,54 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="disable_cached_reads" rev="1.4.0">
-
-  <title>DISABLE_CACHED_READS Query Option</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="HDFS"/>
-      <data name="Category" value="HDFS Caching"/>
-      <data name="Category" value="Querying"/>
-      <data name="Category" value="Performance"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p>
-      <indexterm audience="hidden">DISABLE_CACHED_READS query option</indexterm>
-      Prevents Impala from reading data files that are <q>pinned</q> in memory
-      through the HDFS caching feature. Primarily a debugging option for
-      cases where processing of HDFS cached data is concentrated on a single
-      host, leading to excessive CPU usage on that host.
-    </p>
-
-    <p conref="../shared/impala_common.xml#common/type_boolean"/>
-
-    <p conref="../shared/impala_common.xml#common/default_false"/>
-
-    <p conref="../shared/impala_common.xml#common/added_in_140"/>
-
-  </conbody>
-</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_max_io_buffers.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_max_io_buffers.xml b/docs/topics/impala_max_io_buffers.xml
deleted file mode 100644
index 747b4d9..0000000
--- a/docs/topics/impala_max_io_buffers.xml
+++ /dev/null
@@ -1,49 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="max_io_buffers">
-
-  <title>MAX_IO_BUFFERS Query Option</title>
-  <titlealts audience="PDF"><navtitle>MAX_IO_BUFFERS</navtitle></titlealts>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="Deprecated Features"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p>
-      Deprecated query option. Currently has no effect.
-    </p>
-
-    <p>
-      <b>Type:</b> numeric
-    </p>
-
-    <p>
-      <b>Default:</b> 0
-    </p>
-  </conbody>
-</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_order_by.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_order_by.xml b/docs/topics/impala_order_by.xml
index 1c5ac98..a72f74e 100644
--- a/docs/topics/impala_order_by.xml
+++ b/docs/topics/impala_order_by.xml
@@ -265,15 +265,6 @@ SELECT page_title AS "Page 3 of search results", page_url FROM search_content
         </p>
       </li>
 
-      <li>
-        <p>
-          The query options
-          <xref href="impala_default_order_by_limit.xml#default_order_by_limit">DEFAULT_ORDER_BY_LIMIT</xref> and
-          <xref href="impala_abort_on_default_limit_exceeded.xml#abort_on_default_limit_exceeded">ABORT_ON_DEFAULT_LIMIT_EXCEEDED</xref>,
-          which formerly controlled the behavior of <codeph>ORDER BY</codeph> queries with no limit specified, are
-          now ignored.
-        </p>
-      </li>
     </ul>
 
     <p rev="obwl" conref="../shared/impala_common.xml#common/null_sorting_change"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_rm_initial_mem.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_rm_initial_mem.xml b/docs/topics/impala_rm_initial_mem.xml
deleted file mode 100644
index fd9f819..0000000
--- a/docs/topics/impala_rm_initial_mem.xml
+++ /dev/null
@@ -1,47 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="rm_initial_mem" rev="2.5.0">
-
-  <title>RM_INITIAL_MEM Query Option</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p rev="2.5.0">
-      <indexterm audience="hidden">RM_INITIAL_MEM query option</indexterm>
-    </p>
-
-    <p>
-      <b>Type:</b>
-    </p>
-
-    <p>
-      <b>Default:</b>
-    </p>
-  </conbody>
-</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/b4228dfd/docs/topics/impala_scan_node_codegen_threshold.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_scan_node_codegen_threshold.xml b/docs/topics/impala_scan_node_codegen_threshold.xml
deleted file mode 100644
index 40d1bc6..0000000
--- a/docs/topics/impala_scan_node_codegen_threshold.xml
+++ /dev/null
@@ -1,93 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
-<concept id="scan_node_codegen_threshold" rev="2.5.0 IMPALA-1755">
-
-  <title>SCAN_NODE_CODEGEN_THRESHOLD Query Option (<keyword keyref="impala25"/> or higher only)</title>
-  <titlealts audience="PDF"><navtitle>SCAN_NODE_CODEGEN_THRESHOLD</navtitle></titlealts>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Impala"/>
-      <data name="Category" value="Impala Query Options"/>
-      <data name="Category" value="Performance"/>
-      <data name="Category" value="Developers"/>
-      <data name="Category" value="Data Analysts"/>
-    </metadata>
-  </prolog>
-
-  <conbody>
-
-    <p rev="2.5.0 IMPALA-1755">
-      <indexterm audience="hidden">SCAN_NODE_CODEGEN_THRESHOLD query option</indexterm>
-      The <codeph>SCAN_NODE_CODEGEN_THRESHOLD</codeph> query option
-      adjusts the aggressiveness of the code generation optimization process
-      when performing I/O read operations. It can help to work around performance problems
-      for queries where the table is small and the <codeph>WHERE</codeph> clause is complicated.
-    </p>
-
-    <p conref="../shared/impala_common.xml#common/type_integer"/>
-
-    <p>
-      <b>Default:</b> 1800000 (1.8 million)
-    </p>
-
-    <p conref="../shared/impala_common.xml#common/added_in_250"/>
-
-    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
-
-    <p>
-      This query option is intended mainly for the case where a query with a very complicated
-      <codeph>WHERE</codeph> clause, such as an <codeph>IN</codeph> operator with thousands
-      of entries, is run against a small table, especially a small table using Parquet format.
-      The code generation phase can become the dominant factor in the query response time,
-      making the query take several seconds even though there is relatively little work to do.
-      In this case, increase the value of this option to a much larger amount, anything up to
-      the maximum for a 32-bit integer.
-    </p>
-
-    <p>
-      Because this option only affects the code generation phase for the portion of the
-      query that performs I/O (the <term>scan nodes</term> within the query plan), it
-      lets you continue to keep code generation enabled for other queries, and other parts
-      of the same query, that can benefit from it. In contrast, the
-      <codeph>IMPALA_DISABLE_CODEGEN</codeph> query option turns off code generation entirely.
-    </p>
-
-    <p>
-      Because of the way the work for queries is divided internally, this option might not
-      affect code generation for all kinds of queries. If a plan fragment contains a scan
-      node and some other kind of plan node, code generation still occurs regardless of
-      this option setting.
-    </p>
-
-    <p>
-      To use this option effectively, you should be familiar with reading query profile output
-      to determine the proportion of time spent in the code generation phase, and whether
-      code generation is enabled or not for specific plan fragments.
-    </p>
-
-<!--
-    <p conref="../shared/impala_common.xml#common/related_info"/>
-    <p>
-    </p>
--->
-
-  </conbody>
-</concept>

[5/7] impala git commit: IMPALA-6120: Add thread timers for reporting codegen time

Posted by jo...@apache.org.

IMPALA-6120: Add thread timers for reporting codegen time

Add thread times for accurate reporting of codegen time.
Also cleaned up a few places where time elapsed was being counted twice.

Sample Profile:

Query: SELECT count(*) FROM tpch_parquet.lineitem
WHERE l_partkey in (1,6,11,16,21,26,31,36,41);

CodeGen:(Total: 37.948ms, non-child: 37.948ms, % non-child: 100.00%)
   - CodegenInvoluntaryContextSwitches: 0 (0)
   - CodegenTotalWallClockTime: 37.942ms
     - CodegenSysTime: 0.000ns
     - CodegenUserTime: 36.938ms
   - CodegenVoluntaryContextSwitches: 0 (0)
   - CompileTime: 2.065ms
   - IrGenerationTime: 392.351us
   - LoadTime: 0.000ns
   - ModuleBitcodeSize: 2.26 MB (2373148)
   - NumFunctions: 22 (22)
   - NumInstructions: 381 (381)
   - OptimizationTime: 21.416ms
   - PeakMemoryUsage: 190.50 KB (195072)
   - PrepareTime: 13.496ms

Sample Profile with an added 2 sec sleep time to "OptimizationTime":

CodeGen:(Total: 2s037ms, non-child: 2s037ms, % non-child: 100.00%)
   - CodegenInvoluntaryContextSwitches: 0 (0)
   - CodegenTotalWallClockTime: 2s037ms
     - CodegenSysTime: 0.000ns
     - CodegenUserTime: 37.672ms
   - CodegenVoluntaryContextSwitches: 1 (1)
   - CompileTime: 2.032ms
   - IrGenerationTime: 386.948us
   - LoadTime: 0.000ns
   - ModuleBitcodeSize: 2.26 MB (2373148)
   - NumFunctions: 22 (22)
   - NumInstructions: 381 (381)
   - OptimizationTime: 2s023ms
   - PeakMemoryUsage: 190.50 KB (195072)
   - PrepareTime: 11.598ms

Change-Id: I24d5a46b8870bc959b89045432d2e86af72b30e5
Reviewed-on: http://gerrit.cloudera.org:8080/9960
Reviewed-by: Bikramjeet Vig <bi...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/fef527d6
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/fef527d6
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/fef527d6

Branch: refs/heads/master
Commit: fef527d60b1dc33cc083d44f0d697086dab4f361
Parents: 51cf5b2
Author: Bikramjeet Vig <bi...@cloudera.com>
Authored: Mon Apr 9 14:03:52 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Sat Apr 14 02:21:31 2018 +0000

----------------------------------------------------------------------
 be/src/codegen/llvm-codegen.cc              | 21 +++++++++------------
 be/src/codegen/llvm-codegen.h               | 18 ++++++++++--------
 be/src/exec/hdfs-avro-scanner.cc            |  2 --
 be/src/exec/hdfs-parquet-scanner.cc         |  1 -
 be/src/exec/hdfs-scanner.cc                 |  2 --
 be/src/exec/partitioned-aggregation-node.cc |  5 -----
 be/src/exec/select-node.cc                  |  1 -
 be/src/exec/text-converter.cc               |  1 -
 be/src/runtime/fragment-instance-state.cc   | 16 +++++++++++-----
 be/src/runtime/tuple.cc                     |  1 -
 be/src/util/tuple-row-compare.cc            |  1 -
 11 files changed, 30 insertions(+), 39 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/codegen/llvm-codegen.cc
----------------------------------------------------------------------
diff --git a/be/src/codegen/llvm-codegen.cc b/be/src/codegen/llvm-codegen.cc
index c8fd8eb..5d5ed15 100644
--- a/be/src/codegen/llvm-codegen.cc
+++ b/be/src/codegen/llvm-codegen.cc
@@ -201,11 +201,12 @@ LlvmCodeGen::LlvmCodeGen(RuntimeState* state, ObjectPool* pool,
   load_module_timer_ = ADD_TIMER(profile_, "LoadTime");
   prepare_module_timer_ = ADD_TIMER(profile_, "PrepareTime");
   module_bitcode_size_ = ADD_COUNTER(profile_, "ModuleBitcodeSize", TUnit::BYTES);
-  codegen_timer_ = ADD_TIMER(profile_, "CodegenTime");
+  ir_generation_timer_ = ADD_TIMER(profile_, "IrGenerationTime");
   optimization_timer_ = ADD_TIMER(profile_, "OptimizationTime");
   compile_timer_ = ADD_TIMER(profile_, "CompileTime");
   num_functions_ = ADD_COUNTER(profile_, "NumFunctions", TUnit::UNIT);
   num_instructions_ = ADD_COUNTER(profile_, "NumInstructions", TUnit::UNIT);
+  llvm_thread_counters_ = ADD_THREAD_COUNTERS(profile_, "Codegen");
 }
 
 Status LlvmCodeGen::CreateFromFile(RuntimeState* state, ObjectPool* pool,
@@ -213,6 +214,7 @@ Status LlvmCodeGen::CreateFromFile(RuntimeState* state, ObjectPool* pool,
     scoped_ptr<LlvmCodeGen>* codegen) {
   codegen->reset(new LlvmCodeGen(state, pool, parent_mem_tracker, id));
   SCOPED_TIMER((*codegen)->profile_->total_time_counter());
+  SCOPED_THREAD_COUNTER_MEASUREMENT((*codegen)->llvm_thread_counters());
 
   unique_ptr<llvm::Module> loaded_module;
   Status status = (*codegen)->LoadModuleFromFile(file, &loaded_module);
@@ -229,6 +231,8 @@ Status LlvmCodeGen::CreateFromMemory(RuntimeState* state, ObjectPool* pool,
     MemTracker* parent_mem_tracker, const string& id, scoped_ptr<LlvmCodeGen>* codegen) {
   codegen->reset(new LlvmCodeGen(state, pool, parent_mem_tracker, id));
   SCOPED_TIMER((*codegen)->profile_->total_time_counter());
+  SCOPED_TIMER((*codegen)->prepare_module_timer_);
+  SCOPED_THREAD_COUNTER_MEASUREMENT((*codegen)->llvm_thread_counters());
 
   // Select the appropriate IR version. We cannot use LLVM IR with SSE4.2 instructions on
   // a machine without SSE4.2 support.
@@ -282,7 +286,6 @@ Status LlvmCodeGen::LoadModuleFromFile(
 Status LlvmCodeGen::LoadModuleFromMemory(unique_ptr<llvm::MemoryBuffer> module_ir_buf,
     string module_name, unique_ptr<llvm::Module>* module) {
   DCHECK(!module_name.empty());
-  SCOPED_TIMER(prepare_module_timer_);
   COUNTER_ADD(module_bitcode_size_, module_ir_buf->getMemBufferRef().getBufferSize());
   llvm::Expected<unique_ptr<llvm::Module>> tmp_module =
       getOwningLazyBitcodeModule(move(module_ir_buf), context());
@@ -305,7 +308,6 @@ Status LlvmCodeGen::LoadModuleFromMemory(unique_ptr<llvm::MemoryBuffer> module_i
 
 // TODO: Create separate counters/timers (file size, load time) for each module linked
 Status LlvmCodeGen::LinkModuleFromLocalFs(const string& file) {
-  SCOPED_TIMER(profile_->total_time_counter());
   unique_ptr<llvm::Module> new_module;
   RETURN_IF_ERROR(LoadModuleFromFile(file, &new_module));
 
@@ -366,6 +368,7 @@ Status LlvmCodeGen::CreateImpalaCodegen(RuntimeState* state,
   // Parse module for cross compiled functions and types
   SCOPED_TIMER(codegen->profile_->total_time_counter());
   SCOPED_TIMER(codegen->prepare_module_timer_);
+  SCOPED_THREAD_COUNTER_MEASUREMENT(codegen->llvm_thread_counters_);
 
   // Get type for StringValue
   codegen->string_value_type_ = codegen->GetStructType<StringValue>();
@@ -621,7 +624,7 @@ void LlvmCodeGen::CreateIfElseBlocks(llvm::Function* fn, const string& if_name,
   *else_block = llvm::BasicBlock::Create(context(), else_name, fn, insert_before);
 }
 
-Status LlvmCodeGen::MaterializeFunctionHelper(llvm::Function* fn) {
+Status LlvmCodeGen::MaterializeFunction(llvm::Function* fn) {
   DCHECK(!is_compiled_);
   if (fn->isIntrinsic() || !fn->isMaterializable()) return Status::OK();
 
@@ -642,18 +645,12 @@ Status LlvmCodeGen::MaterializeFunctionHelper(llvm::Function* fn) {
     for (const string& callee : *callees) {
       llvm::Function* callee_fn = module_->getFunction(callee);
       DCHECK(callee_fn != nullptr);
-      RETURN_IF_ERROR(MaterializeFunctionHelper(callee_fn));
+      RETURN_IF_ERROR(MaterializeFunction(callee_fn));
     }
   }
   return Status::OK();
 }
 
-Status LlvmCodeGen::MaterializeFunction(llvm::Function* fn) {
-  SCOPED_TIMER(profile_->total_time_counter());
-  SCOPED_TIMER(prepare_module_timer_);
-  return MaterializeFunctionHelper(fn);
-}
-
 llvm::Function* LlvmCodeGen::GetFunction(const string& symbol, bool clone) {
   llvm::Function* fn = module_->getFunction(symbol.c_str());
   if (fn == NULL) {
@@ -1038,7 +1035,6 @@ Status LlvmCodeGen::MaterializeModule() {
 
 // It's okay to call this function even if the module has been materialized.
 Status LlvmCodeGen::FinalizeLazyMaterialization() {
-  SCOPED_TIMER(prepare_module_timer_);
   for (llvm::Function& fn : module_->functions()) {
     if (fn.isMaterializable()) {
       DCHECK(!module_->isMaterialized());
@@ -1078,6 +1074,7 @@ Status LlvmCodeGen::FinalizeModule() {
 
   if (is_corrupt_) return Status("Module is corrupt.");
   SCOPED_TIMER(profile_->total_time_counter());
+  SCOPED_THREAD_COUNTER_MEASUREMENT(llvm_thread_counters_);
 
   // Clean up handcrafted functions that have not been finalized. Clean up is done by
   // deleting the function from the module. Any reference to deleted functions in the

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/codegen/llvm-codegen.h
----------------------------------------------------------------------
diff --git a/be/src/codegen/llvm-codegen.h b/be/src/codegen/llvm-codegen.h
index 783269b..53569ca 100644
--- a/be/src/codegen/llvm-codegen.h
+++ b/be/src/codegen/llvm-codegen.h
@@ -169,7 +169,8 @@ class LlvmCodeGen {
   void Close();
 
   RuntimeProfile* runtime_profile() { return profile_; }
-  RuntimeProfile::Counter* codegen_timer() { return codegen_timer_; }
+  RuntimeProfile::Counter* ir_generation_timer() { return ir_generation_timer_; }
+  RuntimeProfile::ThreadCounters* llvm_thread_counters() { return llvm_thread_counters_; }
 
   /// Turns on/off optimization passes
   void EnableOptimizations(bool enable);
@@ -688,10 +689,6 @@ class LlvmCodeGen {
   /// This function parses the bitcode of 'fn' to populate basic blocks, instructions
   /// and other data structures attached to the function object. Return error status
   /// for any error.
-  Status MaterializeFunctionHelper(llvm::Function* fn);
-
-  /// Entry point for materializing function 'fn'. Invokes MaterializeFunctionHelper()
-  /// to do the actual work. Return error status for any error.
   Status MaterializeFunction(llvm::Function* fn);
 
   /// Materialize the module owned by this codegen object. This will materialize all
@@ -754,11 +751,12 @@ class LlvmCodeGen {
   /// Time spent reading the .ir file from the file system.
   RuntimeProfile::Counter* load_module_timer_;
 
-  /// Time spent constructing the in-memory module from the ir.
+  /// Time spent creating the initial module with the cross-compiled Impala IR.
   RuntimeProfile::Counter* prepare_module_timer_;
 
-  /// Time spent doing codegen (adding IR to the module)
-  RuntimeProfile::Counter* codegen_timer_;
+  /// Time spent by ExecNodes while adding IR to the module. Update by
+  /// FragmentInstanceState during its 'CODEGEN_START' state.
+  RuntimeProfile::Counter* ir_generation_timer_;
 
   /// Time spent optimizing the module.
   RuntimeProfile::Counter* optimization_timer_;
@@ -774,6 +772,10 @@ class LlvmCodeGen {
   RuntimeProfile::Counter* num_functions_;
   RuntimeProfile::Counter* num_instructions_;
 
+  /// Aggregated llvm thread counters. Also includes the phase represented by
+  /// 'ir_generation_timer_' and hence is also updated by FragmentInstanceState.
+  RuntimeProfile::ThreadCounters* llvm_thread_counters_;
+
   /// whether or not optimizations are enabled
   bool optimizations_enabled_;
 

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/hdfs-avro-scanner.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/hdfs-avro-scanner.cc b/be/src/exec/hdfs-avro-scanner.cc
index fe1bed4..e74b589 100644
--- a/be/src/exec/hdfs-avro-scanner.cc
+++ b/be/src/exec/hdfs-avro-scanner.cc
@@ -1062,8 +1062,6 @@ Status HdfsAvroScanner::CodegenReadScalar(const AvroSchemaElement& element,
 Status HdfsAvroScanner::CodegenDecodeAvroData(const HdfsScanNodeBase* node,
     LlvmCodeGen* codegen, const vector<ScalarExpr*>& conjuncts,
     llvm::Function** decode_avro_data_fn) {
-  SCOPED_TIMER(codegen->codegen_timer());
-
   llvm::Function* materialize_tuple_fn;
   RETURN_IF_ERROR(CodegenMaterializeTuple(node, codegen, &materialize_tuple_fn));
   DCHECK(materialize_tuple_fn != nullptr);

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/hdfs-parquet-scanner.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/hdfs-parquet-scanner.cc b/be/src/exec/hdfs-parquet-scanner.cc
index ae22149..73dd29b 100644
--- a/be/src/exec/hdfs-parquet-scanner.cc
+++ b/be/src/exec/hdfs-parquet-scanner.cc
@@ -1019,7 +1019,6 @@ Status HdfsParquetScanner::Codegen(HdfsScanNodeBase* node,
   *process_scratch_batch_fn = nullptr;
   LlvmCodeGen* codegen = node->runtime_state()->codegen();
   DCHECK(codegen != nullptr);
-  SCOPED_TIMER(codegen->codegen_timer());
 
   llvm::Function* fn = codegen->GetFunction(IRFunction::PROCESS_SCRATCH_BATCH, true);
   DCHECK(fn != nullptr);

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/hdfs-scanner.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/hdfs-scanner.cc b/be/src/exec/hdfs-scanner.cc
index a4aee4d..d191d3f 100644
--- a/be/src/exec/hdfs-scanner.cc
+++ b/be/src/exec/hdfs-scanner.cc
@@ -323,7 +323,6 @@ Status HdfsScanner::CodegenWriteCompleteTuple(const HdfsScanNodeBase* node,
     LlvmCodeGen* codegen, const vector<ScalarExpr*>& conjuncts,
     llvm::Function** write_complete_tuple_fn) {
   *write_complete_tuple_fn = NULL;
-  SCOPED_TIMER(codegen->codegen_timer());
   RuntimeState* state = node->runtime_state();
 
   // Cast away const-ness.  The codegen only sets the cached typed llvm struct.
@@ -531,7 +530,6 @@ Status HdfsScanner::CodegenWriteAlignedTuples(const HdfsScanNodeBase* node,
     LlvmCodeGen* codegen, llvm::Function* write_complete_tuple_fn,
     llvm::Function** write_aligned_tuples_fn) {
   *write_aligned_tuples_fn = NULL;
-  SCOPED_TIMER(codegen->codegen_timer());
   DCHECK(write_complete_tuple_fn != NULL);
 
   llvm::Function* write_tuples_fn =

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/partitioned-aggregation-node.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/partitioned-aggregation-node.cc b/be/src/exec/partitioned-aggregation-node.cc
index c6c6189..d7b8c0a 100644
--- a/be/src/exec/partitioned-aggregation-node.cc
+++ b/be/src/exec/partitioned-aggregation-node.cc
@@ -1721,8 +1721,6 @@ Status PartitionedAggregationNode::CodegenCallUda(LlvmCodeGen* codegen,
 //
 Status PartitionedAggregationNode::CodegenUpdateTuple(
     LlvmCodeGen* codegen, llvm::Function** fn) {
-  SCOPED_TIMER(codegen->codegen_timer());
-
   for (const SlotDescriptor* slot_desc : intermediate_tuple_desc_->slots()) {
     if (slot_desc->type().type == TYPE_CHAR) {
       return Status::Expected("PartitionedAggregationNode::CodegenUpdateTuple(): cannot "
@@ -1811,8 +1809,6 @@ Status PartitionedAggregationNode::CodegenUpdateTuple(
 
 Status PartitionedAggregationNode::CodegenProcessBatch(LlvmCodeGen* codegen,
     TPrefetchMode::type prefetch_mode) {
-  SCOPED_TIMER(codegen->codegen_timer());
-
   llvm::Function* update_tuple_fn;
   RETURN_IF_ERROR(CodegenUpdateTuple(codegen, &update_tuple_fn));
 
@@ -1884,7 +1880,6 @@ Status PartitionedAggregationNode::CodegenProcessBatch(LlvmCodeGen* codegen,
 Status PartitionedAggregationNode::CodegenProcessBatchStreaming(
     LlvmCodeGen* codegen, TPrefetchMode::type prefetch_mode) {
   DCHECK(is_streaming_preagg_);
-  SCOPED_TIMER(codegen->codegen_timer());
 
   IRFunction::Type ir_fn = IRFunction::PART_AGG_NODE_PROCESS_BATCH_STREAMING;
   llvm::Function* process_batch_streaming_fn = codegen->GetFunction(ir_fn, true);

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/select-node.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/select-node.cc b/be/src/exec/select-node.cc
index 0f0683b..df57db0 100644
--- a/be/src/exec/select-node.cc
+++ b/be/src/exec/select-node.cc
@@ -48,7 +48,6 @@ void SelectNode::Codegen(RuntimeState* state) {
   DCHECK(state->ShouldCodegen());
   ExecNode::Codegen(state);
   if (IsNodeCodegenDisabled()) return;
-  SCOPED_TIMER(state->codegen()->codegen_timer());
   Status codegen_status = CodegenCopyRows(state);
   runtime_profile()->AddCodegenMsg(codegen_status.ok(), codegen_status);
 }

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/exec/text-converter.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/text-converter.cc b/be/src/exec/text-converter.cc
index 9e919a3..783e384 100644
--- a/be/src/exec/text-converter.cc
+++ b/be/src/exec/text-converter.cc
@@ -112,7 +112,6 @@ Status TextConverter::CodegenWriteSlot(LlvmCodeGen* codegen,
     return Status("TextConverter::CodegenWriteSlot(): Char isn't supported for"
         " CodegenWriteSlot");
   }
-  SCOPED_TIMER(codegen->codegen_timer());
 
   // Codegen is_null_string
   bool is_default_null = (len == 2 && null_col_val[0] == '\\' && null_col_val[1] == 'N');

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/runtime/fragment-instance-state.cc
----------------------------------------------------------------------
diff --git a/be/src/runtime/fragment-instance-state.cc b/be/src/runtime/fragment-instance-state.cc
index 7322519..1a0d452 100644
--- a/be/src/runtime/fragment-instance-state.cc
+++ b/be/src/runtime/fragment-instance-state.cc
@@ -244,11 +244,17 @@ Status FragmentInstanceState::Open() {
   if (runtime_state_->ShouldCodegen()) {
     UpdateState(StateEvent::CODEGEN_START);
     RETURN_IF_ERROR(runtime_state_->CreateCodegen());
-    exec_tree_->Codegen(runtime_state_);
-    // It shouldn't be fatal to fail codegen. However, until IMPALA-4233 is fixed,
-    // ScalarFnCall has no fall back to interpretation when codegen fails so propagates
-    // the error status for now.
-    RETURN_IF_ERROR(runtime_state_->CodegenScalarFns());
+    {
+      SCOPED_TIMER(runtime_state_->codegen()->ir_generation_timer());
+      SCOPED_TIMER(runtime_state_->codegen()->runtime_profile()->total_time_counter());
+      SCOPED_THREAD_COUNTER_MEASUREMENT(
+          runtime_state_->codegen()->llvm_thread_counters());
+      exec_tree_->Codegen(runtime_state_);
+      // It shouldn't be fatal to fail codegen. However, until IMPALA-4233 is fixed,
+      // ScalarFnCall has no fall back to interpretation when codegen fails so propagates
+      // the error status for now.
+      RETURN_IF_ERROR(runtime_state_->CodegenScalarFns());
+    }
 
     LlvmCodeGen* codegen = runtime_state_->codegen();
     DCHECK(codegen != nullptr);

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/runtime/tuple.cc
----------------------------------------------------------------------
diff --git a/be/src/runtime/tuple.cc b/be/src/runtime/tuple.cc
index 627c7a4..0061419 100644
--- a/be/src/runtime/tuple.cc
+++ b/be/src/runtime/tuple.cc
@@ -311,7 +311,6 @@ Status Tuple::CodegenMaterializeExprs(LlvmCodeGen* codegen, bool collect_string_
   if (collect_string_vals) {
     return Status("CodegenMaterializeExprs() collect_string_vals == true NYI");
   }
-  SCOPED_TIMER(codegen->codegen_timer());
   llvm::LLVMContext& context = codegen->context();
 
   // Codegen each compute function from slot_materialize_exprs

http://git-wip-us.apache.org/repos/asf/impala/blob/fef527d6/be/src/util/tuple-row-compare.cc
----------------------------------------------------------------------
diff --git a/be/src/util/tuple-row-compare.cc b/be/src/util/tuple-row-compare.cc
index 5620dae..f05a88e 100644
--- a/be/src/util/tuple-row-compare.cc
+++ b/be/src/util/tuple-row-compare.cc
@@ -203,7 +203,6 @@ Status TupleRowComparator::Codegen(RuntimeState* state) {
 //   ret i32 0
 // }
 Status TupleRowComparator::CodegenCompare(LlvmCodeGen* codegen, llvm::Function** fn) {
-  SCOPED_TIMER(codegen->codegen_timer());
   llvm::LLVMContext& context = codegen->context();
   const vector<ScalarExpr*>& ordering_exprs = ordering_exprs_;
   llvm::Function* key_fns[ordering_exprs.size()];

[4/7] impala git commit: IMPALA-6850: Print actual error message on Sentry error

Posted by jo...@apache.org.

IMPALA-6850: Print actual error message on Sentry error

The patch puts the output of Sentry to
$IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the
same convention as other service output logs.

Testing:
- Injected some failure in run-sentry-service.sh script to see if the
  error message was captured

Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8
Reviewed-on: http://gerrit.cloudera.org:8080/10064
Reviewed-by: Vuk Ercegovac <ve...@cloudera.com>
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/51cf5b27
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/51cf5b27
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/51cf5b27

Branch: refs/heads/master
Commit: 51cf5b27fc3a5f5c7965d1bf88aebe2a6132b538
Parents: b4228df
Author: Fredy Wijaya <fw...@cloudera.com>
Authored: Fri Apr 13 13:46:56 2018 -0500
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Sat Apr 14 01:41:38 2018 +0000

----------------------------------------------------------------------
 testdata/bin/run-all.sh            | 8 ++++----
 testdata/bin/run-sentry-service.sh | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/51cf5b27/testdata/bin/run-all.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/run-all.sh b/testdata/bin/run-all.sh
index f722b89..6820e5d 100755
--- a/testdata/bin/run-all.sh
+++ b/testdata/bin/run-all.sh
@@ -57,8 +57,8 @@ if [[ ${DEFAULT_FS} == "hdfs://localhost:20500" ]]; then
       tee ${IMPALA_CLUSTER_LOGS_DIR}/run-hive-server.log
 
   echo " --> Starting the Sentry Policy Server"
-  $IMPALA_HOME/testdata/bin/run-sentry-service.sh > \
-      ${IMPALA_CLUSTER_LOGS_DIR}/run-sentry-service.log 2>&1
+  $IMPALA_HOME/testdata/bin/run-sentry-service.sh 2>&1 | \
+      tee ${IMPALA_CLUSTER_LOGS_DIR}/run-sentry-service.log
 
 elif [[ ${DEFAULT_FS} == "${LOCAL_FS}" ]]; then
   # When the local file system is used as default, we only start the Hive metastore.
@@ -80,6 +80,6 @@ else
       tee ${IMPALA_CLUSTER_LOGS_DIR}/run-hive-server.log
 
   echo " --> Starting the Sentry Policy Server"
-  $IMPALA_HOME/testdata/bin/run-sentry-service.sh > \
-      ${IMPALA_CLUSTER_LOGS_DIR}/run-sentry-service.log 2>&1
+  $IMPALA_HOME/testdata/bin/run-sentry-service.sh 2>&1 | \
+      tee ${IMPALA_CLUSTER_LOGS_DIR}/run-sentry-service.log
 fi

http://git-wip-us.apache.org/repos/asf/impala/blob/51cf5b27/testdata/bin/run-sentry-service.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/run-sentry-service.sh b/testdata/bin/run-sentry-service.sh
index cb6de28..755c382 100755
--- a/testdata/bin/run-sentry-service.sh
+++ b/testdata/bin/run-sentry-service.sh
@@ -23,6 +23,9 @@ trap 'echo Error in $0 at line $LINENO: $(cd "'$PWD'" && awk "NR == $LINENO" $0)
 . ${IMPALA_HOME}/bin/set-classpath.sh
 
 SENTRY_SERVICE_CONFIG=${SENTRY_CONF_DIR}/sentry-site.xml
+LOGDIR="${IMPALA_CLUSTER_LOGS_DIR}"/sentry
+
+mkdir -p "${LOGDIR}" || true
 
 # First kill any running instances of the service.
 $IMPALA_HOME/testdata/bin/kill-sentry-service.sh
@@ -30,7 +33,7 @@ $IMPALA_HOME/testdata/bin/kill-sentry-service.sh
 # Sentry picks up JARs from the HADOOP_CLASSPATH and not the CLASSPATH.
 export HADOOP_CLASSPATH=${POSTGRES_JDBC_DRIVER}
 # Start the service.
-${SENTRY_HOME}/bin/sentry --command service -c ${SENTRY_SERVICE_CONFIG} &
+${SENTRY_HOME}/bin/sentry --command service -c ${SENTRY_SERVICE_CONFIG} > "${LOGDIR}"/sentry.out 2>&1 &
 
 # Wait for the service to come online
 "$JAVA" -cp $CLASSPATH org.apache.impala.testutil.SentryServicePinger \

[6/7] impala git commit: IMPALA-6483: [DOCS] Document the new EXEC_TIME_LIMIT_S query option

Posted by jo...@apache.org.

IMPALA-6483: [DOCS] Document the new EXEC_TIME_LIMIT_S query option

Change-Id: I7a83aa42e6fffc7cb71112936129a0f1917ec2fd
Reviewed-on: http://gerrit.cloudera.org:8080/10043
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/e53bf279
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/e53bf279
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/e53bf279

Branch: refs/heads/master
Commit: e53bf279b07bd66c4371185eaec1f9a33261bf30
Parents: fef527d
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Wed Apr 11 20:20:00 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Sat Apr 14 16:32:55 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                      |  1 +
 docs/impala_keydefs.ditamap              |  1 +
 docs/shared/impala_common.xml            |  8 +++
 docs/topics/impala_exec_time_limit_s.xml | 93 +++++++++++++++++++++++++++
 4 files changed, 103 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/e53bf279/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index e6c9c09..08b69ca 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -186,6 +186,7 @@ under the License.
           <topicref rev="2.5.0" href="topics/impala_disable_streaming_preaggregations.xml"/>
           <topicref href="topics/impala_disable_unsafe_spills.xml"/>
           <topicref href="topics/impala_exec_single_node_rows_threshold.xml"/>
+          <topicref href="topics/impala_exec_time_limit_s.xml"/>
           <topicref href="topics/impala_explain_level.xml"/>
           <topicref href="topics/impala_hbase_cache_blocks.xml"/>
           <topicref href="topics/impala_hbase_caching.xml"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/e53bf279/docs/impala_keydefs.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap
index 0a2ac82..fb60e14 100644
--- a/docs/impala_keydefs.ditamap
+++ b/docs/impala_keydefs.ditamap
@@ -10783,6 +10783,7 @@ under the License.
   <keydef href="topics/impala_disable_streaming_preaggregations.xml" keys="disable_streaming_preaggregations"/>
   <keydef href="topics/impala_disable_unsafe_spills.xml" keys="disable_unsafe_spills"/>
   <keydef href="topics/impala_exec_single_node_rows_threshold.xml" keys="exec_single_node_rows_threshold"/>
+  <keydef href="topics/impala_exec_time_limit_s.xml" keys="exec_time_limit_s"/>
   <keydef href="topics/impala_explain_level.xml" keys="explain_level"/>
   <keydef href="topics/impala_hbase_cache_blocks.xml" keys="hbase_cache_blocks"/>
   <keydef href="topics/impala_hbase_caching.xml" keys="hbase_caching"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/e53bf279/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index 192e766..12d7d4e 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -2901,6 +2901,14 @@ flight_num:           INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301
         <b>Internal details:</b> Represented in memory as a byte array with the minimum size needed to represent
         each value.
       </p>
+      <p rev="3.0" id="added_in_30">
+        <b>Added in:</b>
+        <keyword keyref="impala30_full"/>
+      </p>
+      <p rev="2.12.0" id="added_in_212">
+        <b>Added in:</b>
+        <keyword keyref="impala212_full"/>
+      </p>
 
       <p rev="2.11.0" id="added_in_2110">
         <b>Added in:</b> <keyword keyref="impala2_11_0"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/e53bf279/docs/topics/impala_exec_time_limit_s.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_exec_time_limit_s.xml b/docs/topics/impala_exec_time_limit_s.xml
new file mode 100644
index 0000000..a0320b8
--- /dev/null
+++ b/docs/topics/impala_exec_time_limit_s.xml
@@ -0,0 +1,93 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.12.0" id="exec_time_limit_s">
+
+  <title>EXEC_TIME_LIMIT_S Query Option (<keyword keyref="impala212_full"/> or higher only)</title>
+
+  <titlealts audience="PDF">
+
+    <navtitle>EXEC_TIME_LIMIT_S</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p rev="2.12.0">
+      The <codeph>EXEC_TIME_LIMIT_S</codeph> query option sets a time limit on query execution.
+      If a query is still executing when time limit expires, it is automatically canceled. The
+      option is intended to prevent runaway queries that execute for much longer than intended.
+    </p>
+
+    <p>
+      For example, an Impala administrator could set a default value of
+      <codeph>EXEC_TIME_LIMIT_S=3600</codeph> for a resource pool to automatically kill queries
+      that execute for longer than one hour (see
+      <xref href="impala_admission.xml#admission_control"/> for information about default query
+      options). Then, if a user accidentally runs a large query that executes for more than one
+      hour, it will be automatically killed after the time limit expires to free up resources.
+      Users can override the default value per query or per session if they do not want the
+      default <codeph>EXEC_TIME_LIMIT_S</codeph> value to apply to a specific query or a
+      session.
+    </p>
+
+    <note>
+      <p>
+        The time limit only starts once the query is executing. Time spent planning the query,
+        scheduling the query, or in admission control is not counted towards the execution time
+        limit. <codeph>SELECT</codeph> statements are eligible for automatic cancellation until
+        the client has fetched all result rows. DML queries are eligible for automatic
+        cancellation until the DML statement has finished.
+      </p>
+    </note>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>SET EXEC_TIME_LIMIT_S=<varname>seconds</varname>;</codeblock>
+
+    <p>
+      <b>Type:</b> numeric
+    </p>
+
+    <p>
+      <b>Default:</b> 0 (no time limit )
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/added_in_212"/>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      <xref href="impala_timeouts.xml#timeouts"/>
+    </p>
+
+  </conbody>
+
+</concept>