You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/01 15:45:03 UTC
[2/4] impala git commit: IMPALA-6070: Further improvements to test-with-docker.

IMPALA-6070: Further improvements to test-with-docker.

This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.

Bug fixes:

* Embarassingly, I was still skipping thrift-server-test in the backend
  tests. This was a mistake in handling feedback from my last review.

* I made the timeline a little bit taller to clip less.

Adding workloads:

* I added the RAT licensing check.

* I added exhaustive runs. This led me to model the suites a little
  bit more in Python, with a class representing a suite with a
  bunch of data about the suite. It's not perfect and still
  coupled with the entrypoint.sh shell script, but it feels
  workable. As part of adding exhaustive tests, I had
  to re-work the timeout handling, since now different
  suites meaningfully have different timeouts.

Speed ups:

* To speed up test runs, I added a mechanism to split py.test suites into
  multiple shards with a py.test argument. This involved a little bit of work in
  conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.

  Furthermore, I moved a bit more logic about managing the
  list of suites into Python.

* Doing the full build with "-notests" and only building
  the backend tests in the relevant target that needs them. This speeds
  up "docker commit" significantly by removing about 20GB from the
  container.  I had to indicates that expr-codegen-test depends on
  expr-codegen-test-ir, which was missing.

* I sped up copying the Kudu data: previously I did
  both a move and a copy; now I'm doing a move followed by a move. One
  of the moves is cross-filesystem so is slow, but this does half the
  amount of copying.

Memory usage:

* I tweaked the memlimit_gb settings to have a higher default. I've been
  fighting empirically to have the tests run well on c4.8xlarge and
  m4.10xlarge.

The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:

 * The non-first Impalad does very little JVM work, so having
   an Xmx keeps it small, even in the parallel tests.
 * Datanodes do work, but they essentially were never garbage
   collecting, because JVM defaults let them use up to 1/4th
   the machine memory. (I observed this based on RSS at the
   end of the run; nothing fancier.) Adding Xms/Xmx settings
   helped.
 * Similarly, I piped the settings through to HBase.

A few daemons still run without resource limitations, but they don't
seem to be a problem.

Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/10248
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/d733ea68
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/d733ea68
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/d733ea68

Branch: refs/heads/2.x
Commit: d733ea68ca144798ff67054faf577dfdae0f201e
Parents: 1dc6dc3
Author: Philip Zeyliger <ph...@cloudera.com>
Authored: Fri Apr 6 10:16:39 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Tue May 1 02:46:38 2018 +0000

----------------------------------------------------------------------
 be/src/exprs/CMakeLists.txt                     |   1 +
 bin/run-all-tests.sh                            |   5 +-
 docker/entrypoint.sh                            | 173 +++++++----
 docker/monitor.py                               |  29 +-
 docker/test-with-docker.py                      | 296 ++++++++++++++-----
 docker/timeline.html.template                   |   9 +-
 testdata/bin/run-hbase.sh                       |   1 +
 .../common/etc/init.d/hdfs-common               |   7 +
 tests/conftest.py                               |  35 +++
 tests/run-tests.py                              |  19 +-
 10 files changed, 426 insertions(+), 149 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/be/src/exprs/CMakeLists.txt
----------------------------------------------------------------------
diff --git a/be/src/exprs/CMakeLists.txt b/be/src/exprs/CMakeLists.txt
index cff391c..755c166 100644
--- a/be/src/exprs/CMakeLists.txt
+++ b/be/src/exprs/CMakeLists.txt
@@ -72,5 +72,6 @@ ADD_BE_TEST(expr-codegen-test)
 # expr-codegen-test includes test IR functions
 COMPILE_TO_IR(expr-codegen-test.cc)
 add_dependencies(expr-codegen-test-ir gen-deps)
+add_dependencies(expr-codegen-test expr-codegen-test-ir)
 
 ADD_UDF_TEST(aggregate-functions-test)

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/bin/run-all-tests.sh
----------------------------------------------------------------------
diff --git a/bin/run-all-tests.sh b/bin/run-all-tests.sh
index 4743a4f..7702134 100755
--- a/bin/run-all-tests.sh
+++ b/bin/run-all-tests.sh
@@ -57,6 +57,8 @@ fi
 : ${TEST_START_CLUSTER_ARGS:=}
 # Extra args to pass to run-tests.py
 : ${RUN_TESTS_ARGS:=}
+# Extra args to pass to run-custom-cluster-tests.sh
+: ${RUN_CUSTOM_CLUSTER_TESTS_ARGS:=}
 if [[ "${TARGET_FILESYSTEM}" == "local" ]]; then
   # TODO: Remove abort_on_config_error flag from here and create-load-data.sh once
   # checkConfiguration() accepts the local filesystem (see IMPALA-1850).
@@ -223,7 +225,8 @@ do
     # Run the custom-cluster tests after all other tests, since they will restart the
     # cluster repeatedly and lose state.
     # TODO: Consider moving in to run-tests.py.
-    if ! "${IMPALA_HOME}/tests/run-custom-cluster-tests.sh" ${COMMON_PYTEST_ARGS}; then
+    if ! "${IMPALA_HOME}/tests/run-custom-cluster-tests.sh" ${COMMON_PYTEST_ARGS} \
+        ${RUN_CUSTOM_CLUSTER_TESTS_ARGS}; then
       TEST_RET_CODE=1
     fi
     export IMPALA_MAX_LOG_FILES="${IMPALA_MAX_LOG_FILES_SAVE}"

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/docker/entrypoint.sh
----------------------------------------------------------------------
diff --git a/docker/entrypoint.sh b/docker/entrypoint.sh
index d371d25..00b2ee6 100755
--- a/docker/entrypoint.sh
+++ b/docker/entrypoint.sh
@@ -90,12 +90,6 @@ function impala_environment() {
 function boot_container() {
   pushd /home/impdev/Impala
 
-  # Required for metastore
-  sudo service postgresql start
-
-  # Required for starting HBase
-  sudo service ssh start
-
   # Make log directories. This is typically done in buildall.sh.
   mkdir -p logs/be_tests logs/fe_tests/coverage logs/ee_tests logs/custom_cluster_tests
 
@@ -112,17 +106,37 @@ function boot_container() {
   echo Hosts file:
   cat /etc/hosts
 
-  # Make a copy of Kudu's WALs to avoid isue with Docker filesystems (aufs and
+  popd
+}
+
+function start_minicluster {
+  # The subshell here avoids the verbose output from set -x.
+  (echo ">>> Starting PostgreSQL and SSH") 2> /dev/null
+  pushd /home/impdev/Impala
+
+  # Required for metastore
+  sudo service postgresql start
+
+  # Required for starting HBase
+  sudo service ssh start
+
+  (echo ">>> Copying Kudu Data") 2> /dev/null
+  # Move around Kudu's WALs to avoid issue with Docker filesystems (aufs and
   # overlayfs) that don't support os.rename(2) on directories, which Kudu
   # requires. We make a fresh copy of the data, in which case rename(2) works
   # presumably because there's only one layer involved. See
   # https://issues.apache.org/jira/browse/KUDU-1419.
-  cd /home/impdev/Impala/testdata
+  set -x
+  pushd /home/impdev/Impala/testdata
   for x in cluster/cdh*/node-*/var/lib/kudu/*/wal; do
+    echo $x
+    # This mv takes time, as it's actually copying into the latest layer.
     mv $x $x-orig
-    cp -r $x-orig $x
-    rm -r $x-orig
+    mkdir $x
+    mv $x-orig/* $x
+    rmdir $x-orig
   done
+  popd
 
   # Wait for postgresql to really start; if it doesn't, Hive Metastore will fail to start.
   for i in {1..120}; do
@@ -135,6 +149,9 @@ function boot_container() {
   done
   sudo -u postgres psql -c "select 1"
 
+  (echo ">>> Starting mini cluster") 2> /dev/null
+  testdata/bin/run-all.sh
+
   popd
 }
 
@@ -164,8 +181,13 @@ function build_impdev() {
 
   # Builds Impala and loads test data.
   # Note that IMPALA-6494 prevents us from using shared library builds,
-  # which are smaller and thereby speed things up.
-  ./buildall.sh -noclean -format -testdata -skiptests
+  # which are smaller and thereby speed things up. We use "-notests"
+  # to avoid building backend tests, which are sizable, and
+  # can be built when executing those tests.
+  ./buildall.sh -noclean -format -testdata -notests
+
+  # Dump current memory usage to logs, before shutting things down.
+  memory_usage
 
   # Shut down things cleanly.
   testdata/bin/kill-all.sh
@@ -176,9 +198,34 @@ function build_impdev() {
   # Clean up things we don't need to reduce image size
   find be -name '*.o' -execdir rm '{}' + # ~1.6GB
 
+  # Clean up dangling symlinks. These (typically "cluster/cdh*-node-*")
+  # may point to something inside a container that no longer exists
+  # and can confuse Jenkins.
+  find /logs -xtype l -execdir rm '{}' ';'
+
   popd
 }
 
+# Prints top 20 RSS consumers (and other, total), in megabytes Common culprits
+# are Java processes without Xmx set. Since most things don't reclaim memory,
+# this is a decent proxy for peak memory usage by long-lived processes.
+function memory_usage() {
+  (
+  echo "Top 20 memory consumers (RSS in MBs)"
+  sudo ps -axho rss,args | \
+    sed -e 's/^ *//' | \
+    sed -e 's, ,\t,' | \
+    sort -nr | \
+    awk -F'\t' '
+    FNR < 20 { print $1/1024.0, $2; total += $1/1024.0 }
+    FNR >= 20 { other+= $1/1024.0; total += $1/1024.0 }
+    END {
+      if (other) { print other, "-- other --" };
+      print total, "-- total --"
+    }'
+  ) >& /logs/memory_usage.txt
+}
+
 # Runs a suite passed in as the first argument. Tightly
 # coupled with Impala's run-all-tests and the suite names.
 # from test-with-docker.py.
@@ -189,6 +236,8 @@ function test_suite() {
 
   # These test suites are for testing.
   if [[ $1 == NOOP ]]; then
+    # Sleep busily for 10 seconds.
+    bash -c 'while [[ $SECONDS -lt 10 ]]; do :; done'
     return 0
   fi
   if [[ $1 == NOOP_FAIL ]]; then
@@ -208,31 +257,9 @@ function test_suite() {
   boot_container
   impala_environment
 
-  # By default, the JVM will use 1/4 of your OS memory for its heap size. For a
-  # long-running test, this will delay GC inside of impalad's leading to
-  # unnecessarily large process RSS footprints. We cap the heap size at
-  # a more reasonable size.  Note that "test_insert_large_string" fails
-  # at 2g and 3g, so the suite that includes it (EE_TEST_PARALLEL) gets
-  # additional memory.
-  #
-  # Similarly, bin/start-impala-cluster typically configures the memlimit
-  # to be 80% of the machine memory, divided by the number of daemons.
-  # If multiple containers are to be run simultaneously, this is scaled
-  # down in test-with-docker.py (and further configurable with --impalad-mem-limit-bytes)
-  # and passed in via $IMPALAD_MEM_LIMIT_BYTES to the container. There is a
-  # relationship between the number of parallel tests that can be run by py.test and this
-  # limit.
-  JVM_HEAP_GB=2
-  if [[ $1 = EE_TEST_PARALLEL ]]; then
-    JVM_HEAP_GB=4
-  fi
-  export TEST_START_CLUSTER_ARGS="--jvm_args=-Xmx${JVM_HEAP_GB}g \
-    --impalad_args=--mem_limit=$IMPALAD_MEM_LIMIT_BYTES"
-
   # BE tests don't require the minicluster, so we can run them directly.
   if [[ $1 = BE_TEST ]]; then
-    # IMPALA-6494: thrift-server-test fails in Ubuntu16.04 for the moment; skip it.
-    export SKIP_BE_TEST_PATTERN='thrift-server-test*'
+    make -j$(nproc) --load-average=$(nproc) be-test be-benchmarks
     if ! bin/run-backend-tests.sh; then
       echo "Tests $1 failed!"
       return 1
@@ -242,37 +269,73 @@ function test_suite() {
     fi
   fi
 
+  if [[ $1 == RAT_CHECK ]]; then
+    # Runs Apache RAT (a license checker)
+    git archive --prefix=rat/ -o rat-impala.zip HEAD
+    wget --quiet https://archive.apache.org/dist/creadur/apache-rat-0.12/apache-rat-0.12-bin.tar.gz
+    tar xzf apache-rat-0.12-bin.tar.gz
+    java -jar apache-rat-0.12/apache-rat-0.12.jar -x rat-impala.zip > logs/rat.xml
+    bin/check-rat-report.py bin/rat_exclude_files.txt logs/rat.xml
+    return $?
+  fi
+
   # Start the minicluster
-  testdata/bin/run-all.sh
+  start_minicluster
 
-  export MAX_PYTEST_FAILURES=0
-  # Choose which suite to run; this is how run-all.sh chooses between them.
-  export FE_TEST=false
-  export BE_TEST=false
-  export EE_TEST=false
-  export JDBC_TEST=false
-  export CLUSTER_TEST=false
-
-  eval "export ${1}=true"
-
-  if [[ ${1} = "EE_TEST_SERIAL" ]]; then
-    # We bucket the stress tests with the parallel tests.
-    export RUN_TESTS_ARGS="--skip-parallel --skip-stress"
-    export EE_TEST=true
-  elif [[ ${1} = "EE_TEST_PARALLEL" ]]; then
-    export RUN_TESTS_ARGS="--skip-serial"
-    export EE_TEST=true
+  # By default, the JVM will use 1/4 of your OS memory for its heap size. For a
+  # long-running test, this will delay GC inside of impalad's leading to
+  # unnecessarily large process RSS footprints. To combat this, we
+  # set a small initial heap size, and then cap it at a more reasonable
+  # size. The small initial heap sizes help for daemons that do little
+  # in the way of JVM work (e.g., the 2nd and 3rd impalad's).
+  # Note that "test_insert_large_string" fails at 2g and 3g, so the suite that
+  # includes it (EE_TEST_PARALLEL) gets additional memory.
+
+  # Note that we avoid using TEST_START_CLUSTER_ARGS="--jvm-args=..."
+  # because it gets flattened along the way if we need to provide
+  # more than one Java argument. We use JAVA_TOOL_OPTIONS instead.
+  JVM_HEAP_MAX_GB=2
+  if [[ $1 = EE_TEST_PARALLEL ]]; then
+    JVM_HEAP_MAX_GB=4
+  elif [[ $1 = EE_TEST_PARALLEL_EXHAUSTIVE ]]; then
+    JVM_HEAP_MAX_GB=8
   fi
+  JAVA_TOOL_OPTIONS="-Xms512M -Xmx${JVM_HEAP_MAX_GB}G"
+
+  # Similarly, bin/start-impala-cluster typically configures the memlimit
+  # to be 80% of the machine memory, divided by the number of daemons.
+  # If multiple containers are to be run simultaneously, this is scaled
+  # down in test-with-docker.py (and further configurable with --impalad-mem-limit-bytes)
+  # and passed in via $IMPALAD_MEM_LIMIT_BYTES to the container. There is a
+  # relationship between the number of parallel tests that can be run by py.test and this
+  # limit.
+  export TEST_START_CLUSTER_ARGS="--impalad_args=--mem_limit=$IMPALAD_MEM_LIMIT_BYTES"
+
+  export MAX_PYTEST_FAILURES=0
+
+  # Asserting that these should are all set (to either true or false as strings).
+  # This is how run-all.sh chooses between them.
+  [[ $FE_TEST && $BE_TEST && $EE_TEST && $JDBC_TEST && $CLUSTER_TEST ]]
 
   ret=0
 
+  if [[ ${EE_TEST} = true ]]; then
+    # test_insert_parquet.py depends on this binary
+    make -j$(nproc) --load-average=$(nproc) parquet-reader
+  fi
+
   # Run tests.
-  if ! time -p bin/run-all-tests.sh; then
+  (echo ">>> $1: Starting run-all-test") 2> /dev/null
+  if ! time -p bash -x bin/run-all-tests.sh; then
     ret=1
     echo "Tests $1 failed!"
   else
     echo "Tests $1 succeeded!"
   fi
+
+  # Save memory usage after tests have run but before shutting down the cluster.
+  memory_usage || true
+
   # Oddly, I've observed bash fail to exit (and wind down the container),
   # leading to test-with-docker.py hitting a timeout. Killing the minicluster
   # daemons fixes this.
@@ -312,6 +375,8 @@ function main() {
   shift
 
   echo ">>> ${CMD} $@ (begin)"
+  # Dump environment, for debugging
+  env | grep -vE "AWS_(SECRET_)?ACCESS_KEY"
   set -x
   if "${CMD}" "$@"; then
     ret=0

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/docker/monitor.py
----------------------------------------------------------------------
diff --git a/docker/monitor.py b/docker/monitor.py
index 64cde0c..58e2a83 100644
--- a/docker/monitor.py
+++ b/docker/monitor.py
@@ -232,28 +232,37 @@ class Timeline(object):
         if self.interesting_re.search(line)]
     return [(container.name,) + split_timestamp(line) for line in interesting_lines]
 
-  @staticmethod
-  def parse_metrics(f):
+  def parse_metrics(self, f):
     """Parses timestamped metric lines.
 
     Given metrics lines like:
 
     2017-10-25 10:08:30.961510 87d5562a5fe0ea075ebb2efb0300d10d23bfa474645bb464d222976ed872df2a cpu user 33 system 15
 
-    Returns an iterable of (ts, container, user_cpu, system_cpu)
+    Returns an iterable of (ts, container, user_cpu, system_cpu). It also updates
+    container.peak_total_rss and container.total_user_cpu and container.total_system_cpu.
     """
     prev_by_container = {}
+    peak_rss_by_container = {}
     for line in f:
       ts, rest = split_timestamp(line.rstrip())
+      total_rss = None
       try:
         container, metric_type, rest2 = rest.split(" ", 2)
-        if metric_type != "cpu":
-          continue
-        _, user_cpu_s, _, system_cpu_s = rest2.split(" ", 3)
+        if metric_type == "cpu":
+          _, user_cpu_s, _, system_cpu_s = rest2.split(" ", 3)
+        elif metric_type == "memory":
+          memory_metrics = rest2.split(" ")
+          total_rss = int(memory_metrics[memory_metrics.index("total_rss") + 1 ])
       except:
         logging.warning("Skipping metric line: %s", line)
         continue
 
+      if total_rss is not None:
+        peak_rss_by_container[container] = max(peak_rss_by_container.get(container, 0),
+            total_rss)
+        continue
+
       prev_ts, prev_user, prev_system = prev_by_container.get(
           container, (None, None, None))
       user_cpu = int(user_cpu_s)
@@ -267,6 +276,14 @@ class Timeline(object):
               (system_cpu - prev_system)/dt/USER_HZ
       prev_by_container[container] = ts, user_cpu, system_cpu
 
+    # Now update container totals
+    for c in self.containers:
+      if c.id in prev_by_container:
+        _, u, s = prev_by_container[c.id]
+        c.total_user_cpu, c.total_system_cpu = u / USER_HZ, s / USER_HZ
+      if c.id in peak_rss_by_container:
+        c.peak_total_rss = peak_rss_by_container[c.id]
+
   def create(self, output):
     # Read logfiles
     timelines = []

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/docker/test-with-docker.py
----------------------------------------------------------------------
diff --git a/docker/test-with-docker.py b/docker/test-with-docker.py
index 59640b4..c3f4427 100755
--- a/docker/test-with-docker.py
+++ b/docker/test-with-docker.py
@@ -95,14 +95,10 @@ as part of the build, in logs/docker/*/timeline.html.
 # Suggested speed improvement TODOs:
 #   - Speed up testdata generation
 #   - Skip generating test data for variants not being run
-#   - Make container image smaller; perhaps make BE test binaries
-#     smaller
-#   - Split up cluster tests into two groups
+#   - Make container image smaller
 #   - Analyze .xml junit files to find slow tests; eradicate
 #     or move to different suite.
-#   - Avoid building BE tests, and build them during execution,
-#     saving on container space as well as baseline build
-#     time.
+#   - Run BE tests earlier (during data load)
 
 # We do not use Impala's python environment here, nor do we depend on
 # non-standard python libraries to avoid needing extra build steps before
@@ -125,11 +121,11 @@ if __name__ == '__main__' and __package__ is None:
 
 base = os.path.dirname(os.path.abspath(__file__))
 
+LOG_FORMAT="%(asctime)s %(threadName)s: %(message)s"
 
-def main():
 
-  logging.basicConfig(level=logging.INFO,
-                      format='%(asctime)s %(threadName)s: %(message)s')
+def main():
+  logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
 
   default_parallel_test_concurrency, default_suite_concurrency, default_memlimit_gb = \
       _compute_defaults()
@@ -160,20 +156,25 @@ def main():
   parser.add_argument(
       '--build-image', metavar='IMAGE',
       help='Skip building, and run tests on pre-existing image.')
-  parser.add_argument(
+
+  suite_group = parser.add_mutually_exclusive_group()
+  suite_group.add_argument(
       '--suite', metavar='VARIANT', action='append',
-      help="Run specific test suites; can be specified multiple times. \
-          If not specified, all tests are run. Choices: " + ",".join(ALL_SUITES))
+      help="""
+        Run specific test suites; can be specified multiple times.
+        Test-with-docker may shard some suites to improve parallelism.
+        If not specified, default tests are run.
+        Default: %s, All Choices: %s
+        """ % (",".join([ s.name for s in DEFAULT_SUITES]),
+          ",".join([ s.name for s in ALL_SUITES ])))
+  suite_group.add_argument('--all-suites', action='store_true', default=False,
+      help="If set, run all available suites.")
   parser.add_argument(
       '--name', metavar='NAME',
       help="Use a specific name for the test run. The name is used " +
       "as a prefix for the container and image names, and " +
       "as part of the log directory naming. Defaults to include a timestamp.",
       default=datetime.datetime.now().strftime("i-%Y%m%d-%H%M%S"))
-  parser.add_argument('--timeout', metavar='MINUTES',
-                      help="Timeout for test suites, in minutes.",
-                      type=int,
-                      default=60*2)
   parser.add_argument('--ccache-dir', metavar='DIR',
                       help="CCache directory to use",
                       default=os.path.expanduser("~/.ccache"))
@@ -181,22 +182,28 @@ def main():
   args = parser.parse_args()
 
   if not args.suite:
-    args.suite = ALL_SUITES
+    if args.all_suites:
+      # Ignore "NOOP" tasks, as they are just for testing.
+      args.suite = [ s.name for s in ALL_SUITES if not s.name.startswith("NOOP") ]
+    else:
+      args.suite = [ s.name for s in DEFAULT_SUITES ]
   t = TestWithDocker(
-      build_image=args.build_image, suites=args.suite,
-      name=args.name, timeout=args.timeout, cleanup_containers=args.cleanup_containers,
+      build_image=args.build_image, suite_names=args.suite,
+      name=args.name, cleanup_containers=args.cleanup_containers,
       cleanup_image=args.cleanup_image, ccache_dir=args.ccache_dir, test_mode=args.test,
       parallel_test_concurrency=args.parallel_test_concurrency,
       suite_concurrency=args.suite_concurrency,
       impalad_mem_limit_bytes=args.impalad_mem_limit_bytes)
 
-  logging.getLogger('').addHandler(
-      logging.FileHandler(os.path.join(_make_dir_if_not_exist(t.log_dir), "log.txt")))
+  fh = logging.FileHandler(os.path.join(_make_dir_if_not_exist(t.log_dir), "log.txt"))
+  fh.setFormatter(logging.Formatter(LOG_FORMAT))
+  logging.getLogger('').addHandler(fh)
 
   logging.info("Arguments: %s", args)
 
   ret = t.run()
   t.create_timeline()
+  t.log_summary()
 
   if not ret:
     sys.exit(1)
@@ -229,10 +236,11 @@ def _compute_defaults():
   logging.info("CPUs: %s Memory (GB): %s", cpus, total_memory_gb)
 
   parallel_test_concurrency = min(cpus, 8)
-  memlimit_gb = 7
+  memlimit_gb = 8
 
   if total_memory_gb >= 95:
     suite_concurrency = 4
+    memlimit_gb = 11
     parallel_test_concurrency = min(cpus, 12)
   elif total_memory_gb >= 65:
     suite_concurrency = 3
@@ -244,19 +252,100 @@ def _compute_defaults():
 
   return parallel_test_concurrency, suite_concurrency, memlimit_gb * 1024 * 1024 * 1024
 
+class Suite(object):
+  """Encapsulates a test suite.
+
+  A test suite is a named thing that the user can select to run,
+  and it runs in its own container, in parallel with other suites.
+  The actual running happens from entrypoint.sh and is controlled
+  mostly by environment variables. When complexity is easier
+  to handle in Python (with its richer data types), we prefer
+  it here.
+  """
+  def __init__(self, name, **envs):
+    """Create suite with given name and environment."""
+    self.name = name
+    self.envs = dict(
+        FE_TEST="false",
+        BE_TEST="false",
+        EE_TEST="false",
+        JDBC_TEST="false",
+        CLUSTER_TEST="false")
+    # If set, this suite is sharded past a certain suite concurrency threshold.
+    self.shard_at_concurrency = None
+    # Variable to which to append --shard_tests
+    self.sharding_variable = None
+    self.envs[name] = "true"
+    self.envs.update(envs)
+    self.timeout_minutes = 120
+
+  def copy(self, name, **envs):
+    """Duplicates current suite allowing for environment updates."""
+    v = dict()
+    v.update(self.envs)
+    v.update(envs)
+    ret = Suite(name, **v)
+    ret.shard_at_concurrency = self.shard_at_concurrency
+    ret.sharding_variable = self.sharding_variable
+    ret.timeout_minutes = self.timeout_minutes
+    return ret
+
+  def exhaustive(self):
+    """Returns an "exhaustive" copy of the suite."""
+    r = self.copy(self.name + "_EXHAUSTIVE", EXPLORATION_STRATEGY="exhaustive")
+    r.timeout_minutes = 240
+    return r
+
+  def sharded(self, shards):
+    """Returns a list of sharded copies of the list.
+
+    key is the name of the variable which needs to be appended with "--shard-tests=..."
+    """
+    # RUN_TESTS_ARGS
+    ret = []
+    for i in range(1, shards + 1):
+      s = self.copy("%s_%d_of_%d" % (self.name, i, shards))
+      s.envs[self.sharding_variable] = self.envs.get(self.sharding_variable, "") \
+          + " --shard_tests=%s/%s" % (i, shards)
+      ret.append(s)
+    return ret
 
-# The names of all the test tracks supported.  NOOP isn't included here, but is
-# handy for testing.  These are organized slowest-to-fastest, so that, when
-# parallelism of suites is limited, the total time is not impacted.
-ALL_SUITES = [
-    "EE_TEST_SERIAL",
-    "EE_TEST_PARALLEL",
-    "CLUSTER_TEST",
-    "BE_TEST",
-    "FE_TEST",
-    "JDBC_TEST",
+# Definitions of all known suites:
+ee_test_serial = Suite("EE_TEST_SERIAL", EE_TEST="true",
+    RUN_TESTS_ARGS="--skip-parallel --skip-stress")
+ee_test_serial.shard_at_concurrency = 4
+ee_test_serial.sharding_variable = "RUN_TESTS_ARGS"
+ee_test_serial_exhaustive = ee_test_serial.exhaustive()
+ee_test_parallel = Suite("EE_TEST_PARALLEL", EE_TEST="true",
+    RUN_TESTS_ARGS="--skip-serial")
+ee_test_parallel_exhaustive = ee_test_parallel.exhaustive()
+cluster_test = Suite("CLUSTER_TEST")
+cluster_test.shard_at_concurrency = 4
+cluster_test.sharding_variable = "RUN_CUSTOM_CLUSTER_TESTS_ARGS"
+cluster_test_exhaustive = cluster_test.exhaustive()
+
+# Default supported suites. These are organized slowest-to-fastest, so that,
+# when parallelism is limited, the total time is least impacted.
+DEFAULT_SUITES = [
+    ee_test_serial,
+    ee_test_parallel,
+    cluster_test,
+    Suite("BE_TEST"),
+    Suite("FE_TEST"),
+    Suite("JDBC_TEST")
 ]
 
+OTHER_SUITES = [
+    ee_test_parallel_exhaustive,
+    ee_test_serial_exhaustive,
+    cluster_test_exhaustive,
+    Suite("RAT_CHECK"),
+    # These are used for testing this script
+    Suite("NOOP"),
+    Suite("NOOP_FAIL"),
+    Suite("NOOP_SLEEP_FOREVER")
+]
+ALL_SUITES = DEFAULT_SUITES + OTHER_SUITES
 
 def _call(args, check=True):
   """Wrapper for calling a subprocess.
@@ -297,6 +386,12 @@ class Container(object):
     self.running = running
     self.start = None
     self.end = None
+    self.removed = False
+
+    # Updated by Timeline class
+    self.total_user_cpu = -1
+    self.total_system_cpu = -1
+    self.peak_total_rss = -1
 
   def runtime_seconds(self):
     if self.start and self.end:
@@ -311,15 +406,13 @@ class Container(object):
 class TestWithDocker(object):
   """Tests Impala using Docker containers for parallelism."""
 
-  def __init__(self, build_image, suites, name, timeout, cleanup_containers,
+  def __init__(self, build_image, suite_names, name, cleanup_containers,
                cleanup_image, ccache_dir, test_mode,
                suite_concurrency, parallel_test_concurrency,
                impalad_mem_limit_bytes):
     self.build_image = build_image
-    self.suites = [TestSuiteRunner(self, suite) for suite in suites]
     self.name = name
     self.containers = []
-    self.timeout_minutes = timeout
     self.git_root = _check_output(["git", "rev-parse", "--show-toplevel"]).strip()
     self.cleanup_containers = cleanup_containers
     self.cleanup_image = cleanup_image
@@ -336,6 +429,26 @@ class TestWithDocker(object):
     self.parallel_test_concurrency = parallel_test_concurrency
     self.impalad_mem_limit_bytes = impalad_mem_limit_bytes
 
+    # Map suites back into objects; we ignore case for this mapping.
+    suites = []
+    suites_by_name = {}
+    for suite in ALL_SUITES:
+      suites_by_name[suite.name.lower()] = suite
+    for suite_name in suite_names:
+      suites.append(suites_by_name[suite_name.lower()])
+
+    # If we have enough concurrency, shard some suites into two halves.
+    suites2 = []
+    for suite in suites:
+      if suite.shard_at_concurrency is not None and \
+          suite_concurrency >= suite.shard_at_concurrency:
+        suites2.extend(suite.sharded(2))
+      else:
+        suites2.append(suite)
+    suites = suites2
+
+    self.suite_runners = [TestSuiteRunner(self, suite) for suite in suites]
+
   def _create_container(self, image, name, logdir, logname, entrypoint, extras=None):
     """Returns a new container.
 
@@ -374,8 +487,10 @@ class TestWithDocker(object):
         + extras
         + [image]
         + entrypoint).strip()
-    return Container(name=name, id_=container_id,
+    ctr = Container(name=name, id_=container_id,
                      logfile=os.path.join(self.log_dir, logdir, logname))
+    logging.info("Created container %s", ctr)
+    return ctr
 
   def _run_container(self, container):
     """Runs container, and returns True if the container had a successful exit value.
@@ -413,15 +528,17 @@ class TestWithDocker(object):
   @staticmethod
   def _stop_container(container):
     """Stops container. Ignores errors (e.g., if it's already exited)."""
-    _call(["docker", "stop", container.id], check=False)
     if container.running:
+      _call(["docker", "stop", container.id], check=False)
       container.end = time.time()
       container.running = False
 
   @staticmethod
   def _rm_container(container):
     """Removes container."""
-    _call(["docker", "rm", container.id], check=False)
+    if not container.removed:
+      _call(["docker", "rm", container.id], check=False)
+      container.removed = True
 
   def _create_build_image(self):
     """Creates the "build image", with Impala compiled and data loaded."""
@@ -451,36 +568,36 @@ class TestWithDocker(object):
         self._rm_container(container)
 
   def _run_tests(self):
-    start_time = time.time()
-    timeout_seconds = self.timeout_minutes * 60
-    deadline = start_time + timeout_seconds
     pool = multiprocessing.pool.ThreadPool(processes=self.suite_concurrency)
     outstanding_suites = []
-    for suite in self.suites:
+    for suite in self.suite_runners:
       suite.task = pool.apply_async(suite.run)
       outstanding_suites.append(suite)
 
     ret = True
-    while time.time() < deadline and len(outstanding_suites) > 0:
-      for suite in list(outstanding_suites):
-        task = suite.task
-        if task.ready():
-          this_task_ret = task.get()
-          outstanding_suites.remove(suite)
-          if this_task_ret:
-            logging.info("Suite %s succeeded.", suite.name)
-          else:
-            logging.info("Suite %s failed.", suite.name)
-            ret = False
-      time.sleep(10)
-    if len(outstanding_suites) > 0:
-      for container in self.containers:
-        self._stop_container(container)
-      for suite in outstanding_suites:
-        suite.task.get()
-      raise Exception("Tasks not finished within timeout (%s minutes): %s" %
-                      (self.timeout_minutes, ",".join([
-                          suite.name for suite in outstanding_suites])))
+    try:
+      while len(outstanding_suites) > 0:
+        for suite in list(outstanding_suites):
+          if suite.timed_out():
+            msg = "Task %s not finished within timeout %s" % (suite.name,
+                suite.suite.timeout_minutes,)
+            logging.error(msg)
+            raise Exception(msg)
+          task = suite.task
+          if task.ready():
+            this_task_ret = task.get()
+            outstanding_suites.remove(suite)
+            if this_task_ret:
+              logging.info("Suite %s succeeded.", suite.name)
+            else:
+              logging.info("Suite %s failed.", suite.name)
+              ret = False
+        time.sleep(5)
+    except KeyboardInterrupt:
+      logging.info("\n\nDetected KeyboardInterrupt; shutting down!\n\n")
+      raise
+    finally:
+      pool.terminate()
     return ret
 
   def run(self):
@@ -495,17 +612,13 @@ class TestWithDocker(object):
       else:
         self.image = self.build_image
       ret = self._run_tests()
-      logging.info("Containers:")
-      for c in self.containers:
-        def to_success_string(exitcode):
-          if exitcode == 0:
-            return "SUCCESS"
-          return "FAILURE"
-        logging.info("%s %s %s %s", to_success_string(c.exitcode), c.name, c.logfile,
-                     c.runtime_seconds())
       return ret
     finally:
       self.monitor.stop()
+      if self.cleanup_containers:
+        for c in self.containers:
+          self._stop_container(c)
+          self._rm_container(c)
       if self.cleanup_image and self.image:
         _call(["docker", "rmi", self.image], check=False)
       logging.info("Memory usage: %s GB min, %s GB max",
@@ -526,6 +639,23 @@ class TestWithDocker(object):
         interesting_re=self._INTERESTING_RE)
     timeline.create(os.path.join(self.log_dir, "timeline.html"))
 
+  def log_summary(self):
+    logging.info("Containers:")
+    def to_success_string(exitcode):
+      if exitcode == 0:
+        return "SUCCESS"
+      return "FAILURE"
+
+    for c in self.containers:
+      logging.info("%s %s %s %0.1fm wall, %0.1fm user, %0.1fm system, " +
+            "%0.1fx parallelism, %0.1f GB peak RSS",
+          to_success_string(c.exitcode), c.name, c.logfile,
+          c.runtime_seconds() / 60.0,
+          c.total_user_cpu / 60.0,
+          c.total_system_cpu / 60.0,
+          (c.total_user_cpu + c.total_system_cpu) / max(c.runtime_seconds(), 0.0001),
+          c.peak_total_rss / 1024.0 / 1024.0 / 1024.0)
+
 
 class TestSuiteRunner(object):
   """Runs a single test suite."""
@@ -534,12 +664,24 @@ class TestSuiteRunner(object):
     self.test_with_docker = test_with_docker
     self.suite = suite
     self.task = None
-    self.name = self.suite.lower()
+    self.name = suite.name.lower()
+    # Set at the beginning of run and facilitates enforcing timeouts
+    # for individual suites.
+    self.deadline = None
+
+  def timed_out(self):
+    return self.deadline is not None and time.time() > self.deadline
 
   def run(self):
     """Runs given test. Returns true on success, based on exit code."""
+    self.deadline = time.time() + self.suite.timeout_minutes * 60
     test_with_docker = self.test_with_docker
     suite = self.suite
+    envs = ["-e", "NUM_CONCURRENT_TESTS=" + str(test_with_docker.parallel_test_concurrency)]
+    for k, v in sorted(suite.envs.iteritems()):
+      envs.append("-e")
+      envs.append("%s=%s" % (k, v))
+
     self.start = time.time()
 
     # io-file-mgr-test expects a real-ish file system at /tmp;
@@ -554,13 +696,11 @@ class TestSuiteRunner(object):
         name=container_name,
         extras=[
             "-v", tmpdir + ":/tmp",
-            "-u", str(os.getuid()),
-            "-e", "NUM_CONCURRENT_TESTS=" +
-            str(test_with_docker.parallel_test_concurrency),
-        ],
+            "-u", str(os.getuid())
+        ] + envs,
         logdir=self.name,
-        logname="log-test-" + self.suite + ".txt",
-        entrypoint=["/mnt/base/entrypoint.sh", "test_suite", suite])
+        logname="log-test-" + self.suite.name + ".txt",
+        entrypoint=["/mnt/base/entrypoint.sh", "test_suite", suite.name])
 
     test_with_docker.containers.append(container)
     test_with_docker.monitor.add(container)
@@ -569,7 +709,7 @@ class TestSuiteRunner(object):
     except:
       return False
     finally:
-      logging.info("Cleaning up containers for %s" % (suite,))
+      logging.info("Cleaning up containers for %s" % (suite.name,))
       test_with_docker._stop_container(container)
       if test_with_docker.cleanup_containers:
         test_with_docker._rm_container(container)

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/docker/timeline.html.template
----------------------------------------------------------------------
diff --git a/docker/timeline.html.template b/docker/timeline.html.template
index c8de821..d2960d7 100644
--- a/docker/timeline.html.template
+++ b/docker/timeline.html.template
@@ -96,9 +96,7 @@ function ts_to_date(secs) {
 }
 
 function drawChart() {
-  var container = document.getElementById('container');
-  var timelineContainer = document.createElement("div");
-  container.appendChild(timelineContainer);
+  var timelineContainer = document.getElementById('timelineContainer');
   var chart = new google.visualization.Timeline(timelineContainer);
   var dataTable = new google.visualization.DataTable();
   dataTable.addColumn({ type: 'string', id: 'Position' });
@@ -115,7 +113,7 @@ function drawChart() {
 
   for (const k of Object.keys(data.metrics)) {
     var lineChart = document.createElement("div");
-    container.appendChild(lineChart);
+    lineChartContainer.appendChild(lineChart);
 
     var dataTable = new google.visualization.DataTable();
     dataTable.addColumn({ type: 'timeofday', id: 'Time' });
@@ -139,4 +137,5 @@ function drawChart() {
   }
 }
 </script>
-<div id="container" style="height: 200px;"></div>
+<div id="timelineContainer" style="height: 400px;"></div>
+<div id="lineChartContainer" style="height: 200px;"></div>

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/testdata/bin/run-hbase.sh
----------------------------------------------------------------------
diff --git a/testdata/bin/run-hbase.sh b/testdata/bin/run-hbase.sh
index f264b65..241ccd2 100755
--- a/testdata/bin/run-hbase.sh
+++ b/testdata/bin/run-hbase.sh
@@ -36,6 +36,7 @@ cat > ${HBASE_CONF_DIR}/hbase-env.sh <<EOF
 export JAVA_HOME=${JAVA_HOME}
 export HBASE_LOG_DIR=${HBASE_LOGDIR}
 export HBASE_PID_DIR=${HBASE_LOGDIR}
+export HBASE_HEAPSIZE=1g
 EOF
 
 # Put zookeeper things in the logs/cluster/zoo directory.

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/testdata/cluster/node_templates/common/etc/init.d/hdfs-common
----------------------------------------------------------------------
diff --git a/testdata/cluster/node_templates/common/etc/init.d/hdfs-common b/testdata/cluster/node_templates/common/etc/init.d/hdfs-common
index 9a7ddd3..d53ecce 100644
--- a/testdata/cluster/node_templates/common/etc/init.d/hdfs-common
+++ b/testdata/cluster/node_templates/common/etc/init.d/hdfs-common
@@ -18,3 +18,10 @@
 export HADOOP_LOG_DIR="$LOG_DIR/hadoop-hdfs"
 export HADOOP_ROOT_LOGGER="${HADOOP_ROOT_LOGGER:-INFO,RFA}"
 export HADOOP_LOGFILE=$(basename $0).log
+
+# Force minicluster processes to have a maximum heap.
+# If unset, on large machines, the JVM default
+# is 1/4th of the RAM (or so), and processes like DataNode
+# end up never garbage collecting.
+export HADOOP_HEAPSIZE_MIN=512m
+export HADOOP_HEAPSIZE_MAX=2g

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/tests/conftest.py
----------------------------------------------------------------------
diff --git a/tests/conftest.py b/tests/conftest.py
index 144fc5c..1e6adda 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -120,6 +120,10 @@ def pytest_addoption(parser):
                    default=False, help="Run all tests with KRPC disabled. This assumes "
                    "that the test cluster has been started with --disable_krpc.")
 
+  parser.addoption("--shard_tests", default=None,
+                   help="If set to N/M (e.g., 3/5), will split the tests into "
+                   "M partitions and run the Nth partition. 1-indexed.")
+
 
 def pytest_assertrepr_compare(op, left, right):
   """
@@ -501,3 +505,34 @@ def validate_pytest_config():
     if any(pytest.config.option.impalad.startswith(loc) for loc in local_prefixes):
       logging.error("--testing_remote_cluster can not be used with a local impalad")
       pytest.exit("Invalid pytest config option: --testing_remote_cluster")
+
+
+@pytest.hookimpl(trylast=True)
+def pytest_collection_modifyitems(items, config, session):
+  """Hook to handle --shard_tests command line option.
+
+  If set, this "deselects" a subset of tests, by hashing
+  their id into buckets.
+  """
+  if not config.option.shard_tests:
+    return
+
+  num_items = len(items)
+  this_shard, num_shards = map(int, config.option.shard_tests.split("/"))
+  assert 0 <= this_shard <= num_shards
+  if this_shard == num_shards:
+    this_shard = 0
+
+  items_selected, items_deselected = [], []
+  for i in items:
+    if hash(i.nodeid) % num_shards == this_shard:
+      items_selected.append(i)
+    else:
+      items_deselected.append(i)
+  config.hook.pytest_deselected(items=items_deselected)
+
+  # We must modify the items list in place for it to take effect.
+  items[:] = items_selected
+
+  logging.info("pytest shard selection enabled %s. Of %d items, selected %d items by hash.",
+      config.option.shard_tests, num_items, len(items))

http://git-wip-us.apache.org/repos/asf/impala/blob/d733ea68/tests/run-tests.py
----------------------------------------------------------------------
diff --git a/tests/run-tests.py b/tests/run-tests.py
index 95e0d11..b1a9fdd 100755
--- a/tests/run-tests.py
+++ b/tests/run-tests.py
@@ -180,6 +180,10 @@ def build_test_args(base_name, valid_dirs=VALID_TEST_DIRS):
     #
     explicit_tests = pytest.config.getoption(FILE_OR_DIR)
     config_options = [arg for arg in commandline_args if arg not in explicit_tests]
+    # We also want to strip out any --shard_tests option and its corresponding value.
+    while "--shard_tests" in config_options:
+      i = config_options.index("--shard_tests")
+      del config_options[i:i+2]
     test_args = ignored_dirs + logging_args + config_options
 
   return test_args
@@ -237,6 +241,11 @@ if __name__ == "__main__":
     test_executor.run_tests(sys.argv[1:])
     sys.exit(0)
 
+  def run(args):
+    """Helper to print out arguments of test_executor before invoking."""
+    print "Running TestExecutor with args: %s" % (args,)
+    test_executor.run_tests(args)
+
   os.chdir(TEST_DIR)
 
   # Create the test result directory if it doesn't already exist.
@@ -248,25 +257,25 @@ if __name__ == "__main__":
   # pytest warnings/messages and displays collected tests
 
   if '--collect-only' in sys.argv:
-    test_executor.run_tests(sys.argv[1:])
+    run(sys.argv[1:])
   else:
     print_metrics('connections')
     # First run query tests that need to be executed serially
     if not skip_serial:
       base_args = ['-m', 'execute_serially']
-      test_executor.run_tests(base_args + build_test_args('serial'))
+      run(base_args + build_test_args('serial'))
       print_metrics('connections')
 
     # Run the stress tests tests
     if not skip_stress:
       base_args = ['-m', 'stress', '-n', NUM_STRESS_CLIENTS]
-      test_executor.run_tests(base_args + build_test_args('stress'))
+      run(base_args + build_test_args('stress'))
       print_metrics('connections')
 
     # Run the remaining query tests in parallel
     if not skip_parallel:
       base_args = ['-m', 'not execute_serially and not stress', '-n', NUM_CONCURRENT_TESTS]
-      test_executor.run_tests(base_args + build_test_args('parallel'))
+      run(base_args + build_test_args('parallel'))
 
     # The total number of tests executed at this point is expected to be >0
     # If it is < 0 then the script needs to exit with a non-zero
@@ -277,7 +286,7 @@ if __name__ == "__main__":
     # Finally, validate impalad/statestored metrics.
     args = build_test_args(base_name='verify-metrics', valid_dirs=['verifiers'])
     args.append('verifiers/test_verify_metrics.py')
-    test_executor.run_tests(args)
+    run(args)
 
   if test_executor.tests_failed:
     sys.exit(1)