You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ap...@apache.org on 2019/04/25 15:54:19 UTC
[arrow] branch master updated: ARROW-4827: [C++] Implement
benchmark comparison
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new c3511db ARROW-4827: [C++] Implement benchmark comparison
c3511db is described below
commit c3511db97e981fd24367916e45fd1d1edd83bb73
Author: François Saint-Jacques <fs...@gmail.com>
AuthorDate: Thu Apr 25 17:54:09 2019 +0200
ARROW-4827: [C++] Implement benchmark comparison
This script/library allows comparing revisions/builds.
Author: François Saint-Jacques <fs...@gmail.com>
Closes #4141 from fsaintjacques/ARROW-4827-benchmark-comparison and squashes the following commits:
a047ae4ed <François Saint-Jacques> Satisfy flake8
e95baf317 <François Saint-Jacques> Add comments and move stuff
ee39a1feb <François Saint-Jacques> Move cpp_runner_from_rev_or_path in CppRunner
2a953f180 <François Saint-Jacques> Missing files
d8e3c1c85 <François Saint-Jacques> Review
514e8e428 <François Saint-Jacques> Introduce RegressionSetArgs
280c93be4 <François Saint-Jacques> Update gitignore
dc031bde7 <François Saint-Jacques> Support conda toolchain
28254676c <François Saint-Jacques> Add --cmake-extras to benchmark-diff command
e6762899c <François Saint-Jacques> Typo
048ba0ede <François Saint-Jacques> Add verbose_third_party
71b10e98a <François Saint-Jacques> Disable python in benchmarks
c3719214c <François Saint-Jacques> Fix flake8 warnings
8845e3e78 <François Saint-Jacques> Remove empty __init__.py
1949f749c <François Saint-Jacques> Supports HEAD revisions
96f999748 <François Saint-Jacques> Add gitignore entry
d9692bc8f <François Saint-Jacques> Fix splitlines
90578af61 <François Saint-Jacques> Add --cmake-extras to build command
7696202ba <François Saint-Jacques> Add doc for bin attribute.
a281ae8e6 <François Saint-Jacques> Various language fixes
1b028390c <François Saint-Jacques> Rename --cxx_flags to --cxx-flags
bc111b2d3 <François Saint-Jacques> Removes copied stuff
d6733b6f4 <François Saint-Jacques> Formatting
21b2e14fc <François Saint-Jacques> Add doc and fix bugs
2a81744cf <François Saint-Jacques> Ooops.
c85661cf3 <François Saint-Jacques> Add documentation
703cf987a <François Saint-Jacques> commit
2c0d512f8 <François Saint-Jacques> Checkpoint
a38f49cd9 <François Saint-Jacques> checkpoint
a5ad76d11 <François Saint-Jacques> Fix syntax
712d2ed3c <François Saint-Jacques> initial commit
---
.gitignore | 2 +
cpp/src/arrow/compute/benchmark-util.h | 13 +
.../arrow/compute/kernels/aggregate-benchmark.cc | 4 +-
dev/archery/archery/benchmark/compare.py | 122 +++++++++
.../archery/archery/benchmark/core.py | 72 +++---
dev/archery/archery/benchmark/google.py | 162 ++++++++++++
dev/archery/archery/benchmark/runner.py | 114 +++++++++
dev/archery/archery/cli.py | 274 +++++++++++++++++++++
dev/archery/archery/lang/cpp.py | 130 ++++++++++
dev/archery/archery/utils/cmake.py | 213 ++++++++++++++++
.gitignore => dev/archery/archery/utils/codec.py | 69 ++----
dev/archery/archery/utils/command.py | 71 ++++++
dev/archery/archery/utils/git.py | 73 ++++++
.gitignore => dev/archery/archery/utils/logger.py | 45 +---
dev/archery/archery/utils/source.py | 141 +++++++++++
.gitignore => dev/archery/setup.py | 58 ++---
.gitignore => dev/archery/tests/test_benchmarks.py | 55 ++---
docs/source/developers/benchmarks.rst | 127 ++++++++++
docs/source/developers/index.rst | 1 +
python/.gitignore | 2 -
20 files changed, 1543 insertions(+), 205 deletions(-)
diff --git a/.gitignore b/.gitignore
index 6bb237a..4a03020 100644
--- a/.gitignore
+++ b/.gitignore
@@ -50,6 +50,8 @@ docs/example1.dat
docs/example3.dat
python/.eggs/
python/doc/
+# Egg metadata
+*.egg-info
.vscode
.idea/
diff --git a/cpp/src/arrow/compute/benchmark-util.h b/cpp/src/arrow/compute/benchmark-util.h
index 1678f8d..865da66 100644
--- a/cpp/src/arrow/compute/benchmark-util.h
+++ b/cpp/src/arrow/compute/benchmark-util.h
@@ -55,5 +55,18 @@ void BenchmarkSetArgs(benchmark::internal::Benchmark* bench) {
bench->Args({static_cast<ArgsType>(size), nulls});
}
+void RegressionSetArgs(benchmark::internal::Benchmark* bench) {
+ // Benchmark changed its parameter type between releases from
+ // int to int64_t. As it doesn't have version macros, we need
+ // to apply C++ template magic.
+ using ArgsType =
+ typename BenchmarkArgsType<decltype(&benchmark::internal::Benchmark::Args)>::type;
+ bench->Unit(benchmark::kMicrosecond);
+
+ // Regressions should only bench L1 data for better stability
+ for (auto nulls : std::vector<ArgsType>({0, 1, 10, 50}))
+ bench->Args({static_cast<ArgsType>(kL1Size), nulls});
+}
+
} // namespace compute
} // namespace arrow
diff --git a/cpp/src/arrow/compute/kernels/aggregate-benchmark.cc b/cpp/src/arrow/compute/kernels/aggregate-benchmark.cc
index e81f879..bbc923f 100644
--- a/cpp/src/arrow/compute/kernels/aggregate-benchmark.cc
+++ b/cpp/src/arrow/compute/kernels/aggregate-benchmark.cc
@@ -309,7 +309,7 @@ BENCHMARK_TEMPLATE(BenchSum, SumBitmapNaive<int64_t>)->Apply(BenchmarkSetArgs);
BENCHMARK_TEMPLATE(BenchSum, SumBitmapReader<int64_t>)->Apply(BenchmarkSetArgs);
BENCHMARK_TEMPLATE(BenchSum, SumBitmapVectorizeUnroll<int64_t>)->Apply(BenchmarkSetArgs);
-static void BenchSumKernel(benchmark::State& state) {
+static void RegressionSumKernel(benchmark::State& state) {
const int64_t array_size = state.range(0) / sizeof(int64_t);
const double null_percent = static_cast<double>(state.range(1)) / 100.0;
auto rand = random::RandomArrayGenerator(1923);
@@ -328,7 +328,7 @@ static void BenchSumKernel(benchmark::State& state) {
state.SetBytesProcessed(state.iterations() * array_size * sizeof(int64_t));
}
-BENCHMARK(BenchSumKernel)->Apply(BenchmarkSetArgs);
+BENCHMARK(RegressionSumKernel)->Apply(RegressionSetArgs);
} // namespace compute
} // namespace arrow
diff --git a/dev/archery/archery/benchmark/compare.py b/dev/archery/archery/benchmark/compare.py
new file mode 100644
index 0000000..bf9811f
--- /dev/null
+++ b/dev/archery/archery/benchmark/compare.py
@@ -0,0 +1,122 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+# Define a global regression threshold as 5%. This is purely subjective and
+# flawed. This does not track cumulative regression.
+DEFAULT_THRESHOLD = 0.05
+
+
+class BenchmarkComparator:
+ """ Compares two benchmarks.
+
+ Encodes the logic of comparing two benchmarks and taking a decision on
+ if it induce a regression.
+ """
+
+ def __init__(self, contender, baseline, threshold=DEFAULT_THRESHOLD,
+ suite_name=None):
+ self.contender = contender
+ self.baseline = baseline
+ self.threshold = threshold
+ self.suite_name = suite_name
+
+ @property
+ def name(self):
+ return self.baseline.name
+
+ @property
+ def less_is_better(self):
+ return self.baseline.less_is_better
+
+ @property
+ def unit(self):
+ return self.baseline.unit
+
+ @property
+ def change(self):
+ new = self.contender.value
+ old = self.baseline.value
+
+ if old == 0 and new == 0:
+ return 0.0
+ if old == 0:
+ return 0.0
+
+ return float(new - old) / abs(old)
+
+ @property
+ def confidence(self):
+ """ Indicate if a comparison of benchmarks should be trusted. """
+ return True
+
+ @property
+ def regression(self):
+ change = self.change
+ adjusted_change = change if self.less_is_better else -change
+ return (self.confidence and adjusted_change > self.threshold)
+
+ def compare(self, comparator=None):
+ return {
+ "benchmark": self.name,
+ "change": self.change,
+ "regression": self.regression,
+ "baseline": self.baseline.value,
+ "contender": self.contender.value,
+ "unit": self.unit,
+ "less_is_better": self.less_is_better,
+ }
+
+ def __call__(self, **kwargs):
+ return self.compare(**kwargs)
+
+
+def pairwise_compare(contender, baseline):
+ dict_contender = {e.name: e for e in contender}
+ dict_baseline = {e.name: e for e in baseline}
+
+ for name in (dict_contender.keys() & dict_baseline.keys()):
+ yield name, (dict_contender[name], dict_baseline[name])
+
+
+class RunnerComparator:
+ """ Compares suites/benchmarks from runners.
+
+ It is up to the caller that ensure that runners are compatible (both from
+ the same language implementation).
+ """
+
+ def __init__(self, contender, baseline, threshold=DEFAULT_THRESHOLD):
+ self.contender = contender
+ self.baseline = baseline
+ self.threshold = threshold
+
+ def comparisons(self, suite_filter=None, benchmark_filter=None):
+ """
+ """
+ contender = self.contender.suites(suite_filter, benchmark_filter)
+ baseline = self.baseline.suites(suite_filter, benchmark_filter)
+ suites = pairwise_compare(contender, baseline)
+
+ for suite_name, (suite_cont, suite_base) in suites:
+ benchmarks = pairwise_compare(
+ suite_cont.benchmarks, suite_base.benchmarks)
+
+ for bench_name, (bench_cont, bench_base) in benchmarks:
+ yield BenchmarkComparator(bench_cont, bench_base,
+ threshold=self.threshold,
+ suite_name=suite_name)
diff --git a/.gitignore b/dev/archery/archery/benchmark/core.py
similarity index 50%
copy from .gitignore
copy to dev/archery/archery/benchmark/core.py
index 6bb237a..83bc273 100644
--- a/.gitignore
+++ b/dev/archery/archery/benchmark/core.py
@@ -15,46 +15,32 @@
# specific language governing permissions and limitations
# under the License.
-apache-rat-*.jar
-arrow-src.tar
-arrow-src.tar.gz
-
-# Compiled source
-*.a
-*.dll
-*.o
-*.py[ocd]
-*.so
-*.so.*
-*.dylib
-.build_cache_dir
-dependency-reduced-pom.xml
-MANIFEST
-compile_commands.json
-build.ninja
-
-# Generated Visual Studio files
-*.vcxproj
-*.vcxproj.*
-*.sln
-*.iml
-
-# Linux perf sample data
-perf.data
-perf.data.old
-
-cpp/.idea/
-cpp/apidoc/xml/
-docs/example.gz
-docs/example1.dat
-docs/example3.dat
-python/.eggs/
-python/doc/
-
-.vscode
-.idea/
-.pytest_cache/
-pkgs
-.Rproj.user
-arrow.Rcheck/
-docker_cache
+import pandas as pa
+
+
+class Benchmark:
+ def __init__(self, name, unit, less_is_better, values, stats=None):
+ self.name = name
+ self.unit = unit
+ self.less_is_better = less_is_better
+ self.values = pa.Series(values)
+ self.statistics = self.values.describe()
+
+ @property
+ def value(self):
+ median = "50%"
+ return float(self.statistics[median])
+
+ def __repr__(self):
+ return f"Benchmark[name={self.name},value={self.value}]"
+
+
+class BenchmarkSuite:
+ def __init__(self, name, benchmarks):
+ self.name = name
+ self.benchmarks = benchmarks
+
+ def __repr__(self):
+ name = self.name
+ benchmarks = self.benchmarks
+ return f"BenchmarkSuite[name={name}, benchmarks={benchmarks}]"
diff --git a/dev/archery/archery/benchmark/google.py b/dev/archery/archery/benchmark/google.py
new file mode 100644
index 0000000..d6efb77
--- /dev/null
+++ b/dev/archery/archery/benchmark/google.py
@@ -0,0 +1,162 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from itertools import filterfalse, groupby, tee
+import json
+import subprocess
+
+from .core import Benchmark
+from ..utils.command import Command
+
+
+def partition(pred, iterable):
+ # adapted from python's examples
+ t1, t2 = tee(iterable)
+ return list(filter(pred, t1)), list(filterfalse(pred, t2))
+
+
+class GoogleBenchmarkCommand(Command):
+ """ Run a google benchmark binary.
+
+ This assumes the binary supports the standard command line options,
+ notably `--benchmark_filter`, `--benchmark_format`, etc...
+ """
+
+ def __init__(self, benchmark_bin, benchmark_filter=None):
+ self.bin = benchmark_bin
+ self.benchmark_filter = benchmark_filter
+
+ def list_benchmarks(self):
+ argv = ["--benchmark_list_tests"]
+ if self.benchmark_filter:
+ argv.append(f"--benchmark_filter={self.benchmark_filter}")
+ result = self.run(*argv, stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE)
+ return str.splitlines(result.stdout.decode("utf-8"))
+
+ def results(self):
+ argv = ["--benchmark_format=json", "--benchmark_repetitions=20"]
+
+ if self.benchmark_filter:
+ argv.append(f"--benchmark_filter={self.benchmark_filter}")
+
+ return json.loads(self.run(*argv, stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE).stdout)
+
+
+class GoogleBenchmarkObservation:
+ """ Represents one run of a single (google c++) benchmark.
+
+ Observations are found when running with `--benchmark_repetitions`. Sadly,
+ the format mixes values and aggregates, e.g.
+
+ RegressionSumKernel/32768/0 1 us 1 us 25.8077GB/s
+ RegressionSumKernel/32768/0 1 us 1 us 25.7066GB/s
+ RegressionSumKernel/32768/0 1 us 1 us 25.1481GB/s
+ RegressionSumKernel/32768/0 1 us 1 us 25.846GB/s
+ RegressionSumKernel/32768/0 1 us 1 us 25.6453GB/s
+ RegressionSumKernel/32768/0_mean 1 us 1 us 25.6307GB/s
+ RegressionSumKernel/32768/0_median 1 us 1 us 25.7066GB/s
+ RegressionSumKernel/32768/0_stddev 0 us 0 us 288.046MB/s
+
+ As from benchmark v1.4.1 (2019-04-24), the only way to differentiate an
+ actual run from the aggregates, is to match on the benchmark name. The
+ aggregates will be appended with `_$agg_name`.
+
+ This class encapsulate the logic to separate runs from aggregate . This is
+ hopefully avoided in benchmark's master version with a separate json
+ attribute.
+ """
+
+ def __init__(self, name, real_time, cpu_time, time_unit, size=None,
+ bytes_per_second=None, **kwargs):
+ self._name = name
+ self.real_time = real_time
+ self.cpu_time = cpu_time
+ self.time_unit = time_unit
+ self.size = size
+ self.bytes_per_second = bytes_per_second
+
+ @property
+ def is_agg(self):
+ """ Indicate if the observation is a run or an aggregate. """
+ suffixes = ["_mean", "_median", "_stddev"]
+ return any(map(lambda x: self._name.endswith(x), suffixes))
+
+ @property
+ def is_realtime(self):
+ """ Indicate if the preferred value is realtime instead of cputime. """
+ return self.name.find("/realtime") != -1
+
+ @property
+ def name(self):
+ name = self._name
+ return name.rsplit("_", maxsplit=1)[0] if self.is_agg else name
+
+ @property
+ def time(self):
+ return self.real_time if self.is_realtime else self.cpu_time
+
+ @property
+ def value(self):
+ """ Return the benchmark value."""
+ return self.bytes_per_second if self.size else self.time
+
+ @property
+ def unit(self):
+ return "bytes_per_second" if self.size else self.time_unit
+
+ def __repr__(self):
+ return f"{self.value}"
+
+
+class GoogleBenchmark(Benchmark):
+ """ A set of GoogleBenchmarkObservations. """
+
+ def __init__(self, name, runs):
+ """ Initialize a GoogleBenchmark.
+
+ Parameters
+ ----------
+ name: str
+ Name of the benchmark
+ runs: list(GoogleBenchmarkObservation)
+ Repetitions of GoogleBenchmarkObservation run.
+
+ """
+ self.name = name
+ # exclude google benchmark aggregate artifacts
+ _, runs = partition(lambda b: b.is_agg, runs)
+ self.runs = sorted(runs, key=lambda b: b.value)
+ unit = self.runs[0].unit
+ # If `size` is found in the json dict, then the benchmark is reported
+ # in bytes per second
+ less_is_better = self.runs[0].size is None
+ values = [b.value for b in self.runs]
+ super().__init__(name, unit, less_is_better, values)
+
+ def __repr__(self):
+ return f"GoogleBenchmark[name={self.name},runs={self.runs}]"
+
+ @classmethod
+ def from_json(cls, payload):
+ def group_key(x):
+ return x.name
+
+ benchmarks = map(lambda x: GoogleBenchmarkObservation(**x), payload)
+ groups = groupby(sorted(benchmarks, key=group_key), group_key)
+ return [cls(k, list(bs)) for k, bs in groups]
diff --git a/dev/archery/archery/benchmark/runner.py b/dev/archery/archery/benchmark/runner.py
new file mode 100644
index 0000000..7dc56bd
--- /dev/null
+++ b/dev/archery/archery/benchmark/runner.py
@@ -0,0 +1,114 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import glob
+import os
+import re
+
+from .core import BenchmarkSuite
+from .google import GoogleBenchmarkCommand, GoogleBenchmark
+from ..lang.cpp import CppCMakeDefinition
+from ..utils.cmake import CMakeBuild
+from ..utils.logger import logger
+
+
+def regex_filter(re_expr):
+ if re_expr is None:
+ return lambda s: True
+ re_comp = re.compile(re_expr)
+ return lambda s: re_comp.search(s)
+
+
+class BenchmarkRunner:
+ def suites(self, suite_filter=None, benchmark_filter=None):
+ raise NotImplementedError("BenchmarkRunner must implement suites")
+
+
+class CppBenchmarkRunner(BenchmarkRunner):
+ def __init__(self, build):
+ """ Initialize a CppBenchmarkRunner. """
+ self.build = build
+
+ @property
+ def suites_binaries(self):
+ """ Returns a list of benchmark binaries for this build. """
+ # Ensure build is up-to-date to run benchmarks
+ self.build()
+ # Not the best method, but works for now
+ glob_expr = os.path.join(self.build.binaries_dir, "*-benchmark")
+ return {os.path.basename(b): b for b in glob.glob(glob_expr)}
+
+ def suite(self, name, suite_bin, benchmark_filter):
+ """ Returns the resulting benchmarks for a given suite. """
+ suite_cmd = GoogleBenchmarkCommand(suite_bin, benchmark_filter)
+
+ # Ensure there will be data
+ benchmark_names = suite_cmd.list_benchmarks()
+ if not benchmark_names:
+ return None
+
+ results = suite_cmd.results()
+ benchmarks = GoogleBenchmark.from_json(results.get("benchmarks"))
+ return BenchmarkSuite(name, benchmarks)
+
+ def suites(self, suite_filter=None, benchmark_filter=None):
+ """ Returns all suite for a runner. """
+ suite_matcher = regex_filter(suite_filter)
+
+ suite_and_binaries = self.suites_binaries
+ for suite_name in suite_and_binaries:
+ if not suite_matcher(suite_name):
+ logger.debug(f"Ignoring suite {suite_name}")
+ continue
+
+ suite_bin = suite_and_binaries[suite_name]
+ suite = self.suite(suite_name, suite_bin,
+ benchmark_filter=benchmark_filter)
+
+ # Filter may exclude all benchmarks
+ if not suite:
+ logger.debug(f"Suite {suite_name} executed but no results")
+ continue
+
+ yield suite
+
+ @staticmethod
+ def from_rev_or_path(src, root, rev_or_path, cmake_conf):
+ """ Returns a CppBenchmarkRunner from a path or a git revision.
+
+ First, it checks if `rev_or_path` points to a valid CMake build
+ directory. If so, it creates a CppBenchmarkRunner with this existing
+ CMakeBuild.
+
+ Otherwise, it assumes `rev_or_path` is a revision and clone/checkout
+ the given revision and create a fresh CMakeBuild.
+ """
+ build = None
+ if CMakeBuild.is_build_dir(rev_or_path):
+ build = CMakeBuild.from_path(rev_or_path)
+ else:
+ root_rev = os.path.join(root, rev_or_path)
+ os.mkdir(root_rev)
+
+ clone_dir = os.path.join(root_rev, "arrow")
+ # Possibly checkout the sources at given revision, no need to
+ # perform cleanup on cloned repository as root_rev is reclaimed.
+ src_rev, _ = src.at_revision(rev_or_path, clone_dir)
+ cmake_def = CppCMakeDefinition(src_rev.cpp, cmake_conf)
+ build = cmake_def.build(os.path.join(root_rev, "build"))
+
+ return CppBenchmarkRunner(build)
diff --git a/dev/archery/archery/cli.py b/dev/archery/archery/cli.py
new file mode 100644
index 0000000..4fa8896
--- /dev/null
+++ b/dev/archery/archery/cli.py
@@ -0,0 +1,274 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import click
+from contextlib import contextmanager
+import json
+import logging
+import sys
+from tempfile import mkdtemp, TemporaryDirectory
+
+from .benchmark.compare import RunnerComparator, DEFAULT_THRESHOLD
+from .benchmark.runner import CppBenchmarkRunner
+from .lang.cpp import CppCMakeDefinition, CppConfiguration
+from .utils.codec import JsonEncoder
+from .utils.logger import logger, ctx as log_ctx
+from .utils.source import ArrowSources
+
+# Set default logging to INFO in command line.
+logging.basicConfig(level=logging.INFO)
+
+
+@click.group()
+@click.option("--debug", type=bool, is_flag=True, default=False,
+ help="Increase logging with debugging output.")
+@click.option("-q", "--quiet", type=bool, is_flag=True, default=False,
+ help="Silence executed commands.")
+@click.pass_context
+def archery(ctx, debug, quiet):
+ """ Apache Arrow developer utilities.
+
+ See sub-commands help with `archery <cmd> --help`.
+
+ """
+ # Ensure ctx.obj exists
+ ctx.ensure_object(dict)
+
+ log_ctx.quiet = quiet
+ if debug:
+ logger.setLevel(logging.DEBUG)
+
+
+def validate_arrow_sources(ctx, param, src):
+ """ Ensure a directory contains Arrow cpp sources. """
+ if isinstance(src, str):
+ if not ArrowSources.valid(src):
+ raise click.BadParameter(f"No Arrow C++ sources found in {src}.")
+ src = ArrowSources(src)
+ return src
+
+
+build_dir_type = click.Path(dir_okay=True, file_okay=False, resolve_path=True)
+# Supported build types
+build_type = click.Choice(["debug", "relwithdebinfo", "release"],
+ case_sensitive=False)
+# Supported warn levels
+warn_level_type = click.Choice(["everything", "checkin", "production"],
+ case_sensitive=False)
+
+
+@archery.command(short_help="Initialize an Arrow C++ build")
+@click.option("--src", metavar="<arrow_src>", default=ArrowSources.find(),
+ callback=validate_arrow_sources,
+ help="Specify Arrow source directory")
+# toolchain
+@click.option("--cc", metavar="<compiler>", help="C compiler.")
+@click.option("--cxx", metavar="<compiler>", help="C++ compiler.")
+@click.option("--cxx-flags", help="C++ compiler flags.")
+@click.option("--build-type", default="release", type=build_type,
+ help="CMake's CMAKE_BUILD_TYPE")
+@click.option("--warn-level", default="production", type=warn_level_type,
+ help="Controls compiler warnings -W(no-)error.")
+# components
+@click.option("--with-tests", default=True, type=bool,
+ help="Build with tests.")
+@click.option("--with-benchmarks", default=False, type=bool,
+ help="Build with benchmarks.")
+@click.option("--with-python", default=True, type=bool,
+ help="Build with python extension.")
+@click.option("--with-parquet", default=False, type=bool,
+ help="Build with parquet file support.")
+@click.option("--with-gandiva", default=False, type=bool,
+ help="Build with Gandiva expression compiler support.")
+@click.option("--with-plasma", default=False, type=bool,
+ help="Build with Plasma object store support.")
+@click.option("--with-flight", default=False, type=bool,
+ help="Build with Flight rpc support.")
+@click.option("--cmake-extras", type=str, multiple=True,
+ help="Extra flags/options to pass to cmake invocation. "
+ "Can be stacked")
+# misc
+@click.option("-f", "--force", type=bool, is_flag=True, default=False,
+ help="Delete existing build directory if found.")
+@click.option("--targets", type=str, multiple=True,
+ help="Generator targets to run. Can be stacked.")
+@click.argument("build_dir", type=build_dir_type)
+@click.pass_context
+def build(ctx, src, build_dir, force, targets, **kwargs):
+ """ Initialize a C++ build directory.
+
+ The build command creates a directory initialized with Arrow's cpp source
+ cmake and configuration. It can also optionally invoke the generator to
+ test the build (and used in scripts).
+
+ Note that archery will carry the caller environment. It will also not touch
+ an existing directory, one must use the `--force` option to remove the
+ existing directory.
+
+ Examples:
+
+ \b
+ # Initialize build with clang7 and avx2 support in directory `clang7-build`
+ \b
+ archery build --cc=clang-7 --cxx=clang++-7 --cxx-flags=-mavx2 clang7-build
+
+ \b
+ # Builds and run test
+ archery build --targets=all --targets=test build
+ """
+ # Arrow's cpp cmake configuration
+ conf = CppConfiguration(**kwargs)
+ # This is a closure around cmake invocation, e.g. calling `def.build()`
+ # yields a directory ready to be run with the generator
+ cmake_def = CppCMakeDefinition(src.cpp, conf)
+ # Create build directory
+ build = cmake_def.build(build_dir, force=force)
+
+ for target in targets:
+ build.run(target)
+
+
+@contextmanager
+def tmpdir(preserve, prefix="arrow-bench-"):
+ if preserve:
+ yield mkdtemp(prefix=prefix)
+ else:
+ with TemporaryDirectory(prefix=prefix) as tmp:
+ yield tmp
+
+
+# Running all benchmarks would be prohibitive. Benchmark who needs to be
+# monitored for regression should be named with this prefix.
+DEFAULT_BENCHMARK_FILTER = "^Regression"
+
+
+@archery.group()
+@click.pass_context
+def benchmark(ctx):
+ """ Arrow benchmarking.
+
+ Use the diff sub-command to benchmake revisions, and/or build directories.
+ """
+ pass
+
+
+@benchmark.command(name="diff", short_help="Run the C++ benchmark suite")
+@click.option("--src", metavar="<arrow_src>", show_default=True,
+ default=ArrowSources.find(),
+ callback=validate_arrow_sources,
+ help="Specify Arrow source directory")
+@click.option("--suite-filter", metavar="<regex>", show_default=True,
+ type=str, default=None, help="Regex filtering benchmark suites.")
+@click.option("--benchmark-filter", metavar="<regex>", show_default=True,
+ type=str, default=DEFAULT_BENCHMARK_FILTER,
+ help="Regex filtering benchmark suites.")
+@click.option("--preserve", type=bool, default=False, show_default=True,
+ is_flag=True, help="Preserve workspace for investigation.")
+@click.option("--threshold", type=float, default=DEFAULT_THRESHOLD,
+ show_default=True,
+ help="Regression failure threshold in percentage.")
+@click.option("--cmake-extras", type=str, multiple=True,
+ help="Extra flags/options to pass to cmake invocation. "
+ "Can be stacked")
+@click.argument("contender", metavar="[<contender>",
+ default=ArrowSources.WORKSPACE, required=False)
+@click.argument("baseline", metavar="[<baseline>]]", default="master",
+ required=False)
+@click.pass_context
+def benchmark_diff(ctx, src, preserve, suite_filter, benchmark_filter,
+ threshold, cmake_extras, contender, baseline):
+ """ Compare (diff) benchmark runs.
+
+ This command acts like git-diff but for benchmark results.
+
+ The caller can optionally specify both the contender and the baseline. If
+ unspecified, the contender will default to the current workspace (like git)
+ and the baseline will default to master.
+
+ Each target (contender or baseline) can either be a git revision
+ (commit, tag, special values like HEAD) or a cmake build directory. This
+ allow comparing git commits, and/or different compilers and/or compiler
+ flags.
+
+ When a commit is referenced, a local clone of the arrow sources (specified
+ via --src) is performed and the proper branch is created. This is done in
+ a temporary directory which can be left intact with the `---preserve` flag.
+
+ The special token "WORKSPACE" is reserved to specify the current git
+ workspace. This imply that no clone will be performed.
+
+ Examples:
+
+ \b
+ # Compare workspace (contender) with master (baseline)
+ \b
+ archery benchmark diff
+
+ \b
+ # Compare master (contender) with latest version (baseline)
+ \b
+ export LAST=$(git tag -l "apache-arrow-[0-9]*" | sort -rV | head -1)
+ \b
+ archery benchmark diff master "$LAST"
+
+ \b
+ # Compare g++7 (contender) with clang++-7 (baseline) builds
+ \b
+ archery build --with-benchmarks=true \\
+ --cxx-flags=-ftree-vectorize \\
+ --cc=gcc-7 --cxx=g++-7 gcc7-build
+ \b
+ archery build --with-benchmarks=true \\
+ --cxx-flags=-flax-vector-conversions \\
+ --cc=clang-7 --cxx=clang++-7 clang7-build
+ \b
+ archery benchmark diff gcc7-build clang7-build
+
+ \b
+ # Compare default targets but scoped to the suites matching
+ # `^arrow-compute-aggregate` and benchmarks matching `(Sum|Mean)Kernel`.
+ \b
+ archery benchmark diff --suite-filter="^arrow-compute-aggregate" \\
+ --benchmark-filter="(Sum|Mean)Kernel"
+ """
+ with tmpdir(preserve) as root:
+ logger.debug(f"Comparing {contender} (contender) with "
+ f"{baseline} (baseline)")
+
+ conf = CppConfiguration(
+ build_type="release", with_tests=True, with_benchmarks=True,
+ with_python=False, cmake_extras=cmake_extras)
+
+ runner_cont = CppBenchmarkRunner.from_rev_or_path(
+ src, root, contender, conf)
+ runner_base = CppBenchmarkRunner.from_rev_or_path(
+ src, root, baseline, conf)
+
+ runner_comp = RunnerComparator(runner_cont, runner_base, threshold)
+ comparisons = runner_comp.comparisons(suite_filter, benchmark_filter)
+
+ regressions = 0
+ for comparator in comparisons:
+ regressions += comparator.regression
+ print(json.dumps(comparator, cls=JsonEncoder))
+
+ sys.exit(regressions)
+
+
+if __name__ == "__main__":
+ archery(obj={})
diff --git a/dev/archery/archery/lang/cpp.py b/dev/archery/archery/lang/cpp.py
new file mode 100644
index 0000000..84b6346
--- /dev/null
+++ b/dev/archery/archery/lang/cpp.py
@@ -0,0 +1,130 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+
+from ..utils.cmake import CMakeDefinition
+
+
+def truthifier(value):
+ return "ON" if value else "OFF"
+
+
+def or_else(value, default):
+ return value if value else default
+
+
+class CppConfiguration:
+ def __init__(self,
+ # toolchain
+ cc=None, cxx=None, cxx_flags=None,
+ build_type=None, warn_level=None,
+ install_prefix=None, use_conda=None,
+ # components
+ with_tests=True, with_benchmarks=False, with_python=True,
+ with_parquet=False, with_gandiva=False, with_plasma=False,
+ with_flight=False, cmake_extras=None):
+ self.cc = cc
+ self.cxx = cxx
+ self.cxx_flags = cxx_flags
+
+ self.build_type = build_type
+ self.warn_level = warn_level
+ self._install_prefix = install_prefix
+ self._use_conda = use_conda
+
+ self.with_tests = with_tests
+ self.with_benchmarks = with_benchmarks
+ self.with_python = with_python
+ self.with_parquet = with_parquet
+ self.with_gandiva = with_gandiva
+ self.with_plasma = with_plasma
+ self.with_flight = with_flight
+ self.cmake_extras = cmake_extras
+
+ def _gen_defs(self):
+ if self.cxx_flags:
+ yield ("ARROW_CXXFLAGS", self.cxx_flags)
+
+ yield ("CMAKE_BUILD_TYPE", or_else(self.build_type, "debug"))
+ yield ("BUILD_WARNING_LEVEL", or_else(self.warn_level, "production"))
+
+ # if not ctx.quiet:
+ # yield ("ARROW_VERBOSE_THIRDPARTY_BUILD", "ON")
+
+ maybe_prefix = self.install_prefix
+ if maybe_prefix:
+ yield ("CMAKE_INSTALL_PREFIX", maybe_prefix)
+
+ yield ("ARROW_BUILD_TESTS", truthifier(self.with_tests))
+ yield ("ARROW_BUILD_BENCHMARKS", truthifier(self.with_benchmarks))
+
+ yield ("ARROW_PYTHON", truthifier(self.with_python))
+ yield ("ARROW_PARQUET", truthifier(self.with_parquet))
+ yield ("ARROW_GANDIVA", truthifier(self.with_gandiva))
+ yield ("ARROW_PLASMA", truthifier(self.with_plasma))
+ yield ("ARROW_FLIGHT", truthifier(self.with_flight))
+
+ # Detect custom conda toolchain
+ if self.use_conda:
+ for d, v in [('CMAKE_AR', 'AR'), ('CMAKE_RANLIB', 'RANLIB')]:
+ v = os.environ.get(v)
+ if v:
+ yield (d, v)
+
+ @property
+ def install_prefix(self):
+ if self._install_prefix:
+ return self._install_prefix
+
+ if self.use_conda:
+ return os.environ.get("CONDA_PREFIX")
+
+ return None
+
+ @property
+ def use_conda(self):
+ # If the user didn't specify a preference, guess via environment
+ if self._use_conda is None:
+ return os.environ.get("CONDA_PREFIX") is not None
+
+ return self._use_conda
+
+ @property
+ def definitions(self):
+ extras = list(self.cmake_extras) if self.cmake_extras else []
+ return [f"-D{d[0]}={d[1]}" for d in self._gen_defs()] + extras
+
+ @property
+ def environment(self):
+ env = os.environ.copy()
+
+ if self.cc:
+ env["CC"] = self.cc
+
+ if self.cxx:
+ env["CXX"] = self.cxx
+
+ return env
+
+
+class CppCMakeDefinition(CMakeDefinition):
+ def __init__(self, source, conf, **kwargs):
+ self.configuration = conf
+ super().__init__(source, **kwargs,
+ definitions=conf.definitions, env=conf.environment,
+ build_type=conf.build_type)
diff --git a/dev/archery/archery/utils/cmake.py b/dev/archery/archery/utils/cmake.py
new file mode 100644
index 0000000..38aedab
--- /dev/null
+++ b/dev/archery/archery/utils/cmake.py
@@ -0,0 +1,213 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import re
+from shutil import rmtree, which
+
+from .command import Command
+
+
+class CMake(Command):
+ def __init__(self, cmake_bin=None):
+ self.bin = cmake_bin if cmake_bin else os.environ.get("CMAKE", "cmake")
+
+ @staticmethod
+ def default_generator():
+ """ Infer default generator.
+
+ Gives precedence to ninja if there exists an executable named `ninja`
+ in the search path.
+ """
+ found_ninja = which("ninja")
+ return "Ninja" if found_ninja else "Make"
+
+
+cmake = CMake()
+
+
+class CMakeDefinition:
+ """ CMakeDefinition captures the cmake invocation arguments.
+
+ It allows creating build directories with the same definition, e.g.
+ ```
+ build_1 = cmake_def.build("/tmp/build-1")
+ build_2 = cmake_def.build("/tmp/build-2")
+
+ ...
+
+ build1.all()
+ build2.all()
+ """
+
+ def __init__(self, source, build_type="release", generator=None,
+ definitions=None, env=None):
+ """ Initialize a CMakeDefinition
+
+ Parameters
+ ----------
+ source : str
+ Source directory where the top-level CMakeLists.txt is
+ located. This is usually the root of the project.
+ generator : str, optional
+ definitions: list(str), optional
+ env : dict(str,str), optional
+ Environment to use when invoking cmake. This can be required to
+ work around cmake deficiencies, e.g. CC and CXX.
+ """
+ self.source = os.path.abspath(source)
+ self.build_type = build_type
+ self.generator = generator if generator else cmake.default_generator()
+ self.definitions = definitions if definitions else []
+ self.env = env
+
+ @property
+ def arguments(self):
+ """" Return the arguments to cmake invocation. """
+ arguments = [
+ f"-G{self.generator}",
+ ] + self.definitions + [
+ self.source
+ ]
+ return arguments
+
+ def build(self, build_dir, force=False, **kwargs):
+ """ Invoke cmake into a build directory.
+
+ Parameters
+ ----------
+ build_dir : str
+ Directory in which the CMake build will be instanciated.
+ force : bool
+ If the build folder exists, delete it before. Otherwise if it's
+ present, an error will be returned.
+ """
+ if os.path.exists(build_dir):
+ # Extra safety to ensure we're deleting a build folder.
+ if not CMakeBuild.is_build_dir(build_dir):
+ raise FileExistsError(f"{build_dir} is not a cmake build")
+ if not force:
+ raise FileExistsError(f"{build_dir} exists use force=True")
+ rmtree(build_dir)
+
+ os.mkdir(build_dir)
+
+ cmake(*self.arguments, cwd=build_dir, env=self.env)
+ return CMakeBuild(build_dir, self.generator.lower(), self.build_type,
+ definition=self, **kwargs)
+
+ def __repr__(self):
+ return f"CMakeDefinition[source={self.source}]"
+
+
+CMAKE_BUILD_TYPE_RE = re.compile("CMAKE_BUILD_TYPE:STRING=([a-zA-Z]+)")
+
+
+class CMakeBuild(Command):
+ """ CMakeBuild represents a build directory initialized by cmake.
+
+ The build instance can be used to build/test/install. It alleviates the
+ user to know which generator is used.
+ """
+
+ def __init__(self, build_dir, generator, build_type, definition=None):
+ """ Initialize a CMakeBuild.
+
+ The caller must ensure that cmake was invoked in the build directory.
+
+ Parameters
+ ----------
+ definition : CMakeDefinition
+ The definition to build from.
+ build_dir : str
+ The build directory to setup into.
+ """
+ assert CMakeBuild.is_build_dir(build_dir)
+ self.build_dir = os.path.abspath(build_dir)
+ self.bin = generator
+ self.build_type = build_type
+ self.definition = definition
+
+ @property
+ def binaries_dir(self):
+ return os.path.join(self.build_dir, self.build_type)
+
+ def run(self, *argv, verbose=False, **kwargs):
+ extra = []
+ if verbose:
+ extra.append("-v" if self.bin.endswith("ninja") else "VERBOSE=1")
+ # Commands must be ran under the build directory
+ super().run(*extra, *argv, **kwargs, cwd=self.build_dir)
+ return self
+
+ def all(self):
+ return self.run("all")
+
+ def clean(self):
+ return self.run("clean")
+
+ def install(self):
+ return self.run("install")
+
+ def test(self):
+ return self.run("test")
+
+ @staticmethod
+ def is_build_dir(path):
+ """ Indicate if a path is CMake build directory.
+
+ This method only checks for the existence of paths and does not do any
+ validation whatsoever.
+ """
+ cmake_cache = os.path.join(path, "CMakeCache.txt")
+ cmake_files = os.path.join(path, "CMakeFiles")
+ return os.path.exists(cmake_cache) and os.path.exists(cmake_files)
+
+ @staticmethod
+ def from_path(path):
+ """ Instantiate a CMakeBuild from a path.
+
+ This is used to recover from an existing physical directory (created
+ with or without CMakeBuild).
+
+ Note that this method is not idempotent as the original definition will
+ be lost. Only some parameters are recovered (generator and build_type).
+ """
+ if not CMakeBuild.is_build_dir(path):
+ raise ValueError(f"Not a valid CMakeBuild path: {path}")
+
+ generator = "make"
+ if os.path.exists(os.path.join(path, "build.ninja")):
+ generator = "ninja"
+
+ build_type = None
+ # Infer build_type by looking at CMakeCache.txt and looking for a magic
+ # definition
+ cmake_cache_path = os.path.join(path, "CMakeCache.txt")
+ with open(cmake_cache_path, "r") as cmake_cache:
+ candidates = CMAKE_BUILD_TYPE_RE.findall(cmake_cache.read())
+ build_type = candidates[0].lower() if candidates else "release"
+
+ return CMakeBuild(path, generator, build_type)
+
+ def __repr__(self):
+ return ("CMakeBuild["
+ "build = {},"
+ "build_type = {},"
+ "definition = {}]".format(self.build_dir,
+ self.build_type,
+ self.definition))
diff --git a/.gitignore b/dev/archery/archery/utils/codec.py
similarity index 52%
copy from .gitignore
copy to dev/archery/archery/utils/codec.py
index 6bb237a..612f2df 100644
--- a/.gitignore
+++ b/dev/archery/archery/utils/codec.py
@@ -15,46 +15,29 @@
# specific language governing permissions and limitations
# under the License.
-apache-rat-*.jar
-arrow-src.tar
-arrow-src.tar.gz
-
-# Compiled source
-*.a
-*.dll
-*.o
-*.py[ocd]
-*.so
-*.so.*
-*.dylib
-.build_cache_dir
-dependency-reduced-pom.xml
-MANIFEST
-compile_commands.json
-build.ninja
-
-# Generated Visual Studio files
-*.vcxproj
-*.vcxproj.*
-*.sln
-*.iml
-
-# Linux perf sample data
-perf.data
-perf.data.old
-
-cpp/.idea/
-cpp/apidoc/xml/
-docs/example.gz
-docs/example1.dat
-docs/example3.dat
-python/.eggs/
-python/doc/
-
-.vscode
-.idea/
-.pytest_cache/
-pkgs
-.Rproj.user
-arrow.Rcheck/
-docker_cache
+
+import json
+
+from ..benchmark.compare import BenchmarkComparator
+
+
+class JsonEncoder(json.JSONEncoder):
+ def default(self, o):
+ if isinstance(o, BenchmarkComparator):
+ comparator = {
+ "benchmark": o.name,
+ "change": o.change,
+ "regression": o.regression,
+ "baseline": o.baseline.value,
+ "contender": o.contender.value,
+ "unit": o.unit,
+ "less_is_better": o.less_is_better,
+ }
+
+ suite_name = o.suite_name
+ if suite_name:
+ comparator["suite"] = suite_name
+
+ return comparator
+
+ return json.JSONEncoder.default(self, o)
diff --git a/dev/archery/archery/utils/command.py b/dev/archery/archery/utils/command.py
new file mode 100644
index 0000000..46d0066
--- /dev/null
+++ b/dev/archery/archery/utils/command.py
@@ -0,0 +1,71 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import shutil
+import subprocess
+
+from .logger import logger, ctx
+
+
+def find_exec(executable):
+ exec_exists = os.path.exists(executable)
+ return executable if exec_exists else shutil.which(executable)
+
+
+# Decorator running a command and returning stdout
+class capture_stdout:
+ def __init__(self, strip=False):
+ self.strip = strip
+
+ def __call__(self, f):
+ def strip_it(x):
+ return x.strip() if self.strip else x
+
+ def wrapper(*argv, **kwargs):
+ # Ensure stdout is captured
+ kwargs["stdout"] = subprocess.PIPE
+ return strip_it(f(*argv, **kwargs).stdout)
+ return wrapper
+
+
+class Command:
+ """ A runnable command.
+
+ Class inheriting from the Command class must provide the bin
+ property/attribute.
+ """
+
+ def run(self, *argv, **kwargs):
+ assert(hasattr(self, "bin"))
+ invocation = [find_exec(self.bin)]
+ invocation.extend(argv)
+
+ for key in ["stdout", "stderr"]:
+ # Preserve caller intention, otherwise silence
+ if key not in kwargs and ctx.quiet:
+ kwargs[key] = subprocess.PIPE
+
+ # Prefer safe by default
+ if "check" not in kwargs:
+ kwargs["check"] = True
+
+ logger.debug(f"Executing `{invocation}`")
+ return subprocess.run(invocation, **kwargs)
+
+ def __call__(self, *argv, **kwargs):
+ self.run(*argv, **kwargs)
diff --git a/dev/archery/archery/utils/git.py b/dev/archery/archery/utils/git.py
new file mode 100644
index 0000000..c611352
--- /dev/null
+++ b/dev/archery/archery/utils/git.py
@@ -0,0 +1,73 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+
+from .command import Command, capture_stdout
+
+
+# Decorator prepending argv with the git sub-command found with the method
+# name.
+def git_cmd(fn):
+ # function name is the subcommand
+ sub_cmd = fn.__name__.replace("_", "-")
+
+ def wrapper(self, *argv, **kwargs):
+ return fn(self, sub_cmd, *argv, **kwargs)
+ return wrapper
+
+
+class Git(Command):
+ def __init__(self, git_bin=None):
+ self.bin = git_bin if git_bin else os.environ.get("GIT", "git")
+
+ def run_cmd(self, cmd, *argv, git_dir=None, **kwargs):
+ """ Inject flags before sub-command in argv. """
+ opts = []
+ if git_dir and isinstance(git_dir, str):
+ opts.extend(("-C", git_dir))
+
+ return self.run(*opts, cmd, *argv, **kwargs)
+
+ @git_cmd
+ def clone(self, *argv, **kwargs):
+ return self.run_cmd(*argv, **kwargs)
+
+ @git_cmd
+ def checkout(self, *argv, **kwargs):
+ return self.run_cmd(*argv, **kwargs)
+
+ @git_cmd
+ def log(self, *argv, **kwargs):
+ return self.run_cmd(*argv, **kwargs)
+
+ @git_cmd
+ def rev_parse(self, *argv, **kwargs):
+ print(self.head())
+ return self.run_cmd(*argv, **kwargs)
+
+ @capture_stdout(strip=True)
+ def head(self, **kwargs):
+ """ Return commit pointed by HEAD. """
+ return self.rev_parse("HEAD", **kwargs)
+
+ @capture_stdout(strip=True)
+ def current_branch(self, **kwargs):
+ return self.rev_parse("--abbrev-ref", "HEAD", **kwargs)
+
+
+git = Git()
diff --git a/.gitignore b/dev/archery/archery/utils/logger.py
similarity index 61%
copy from .gitignore
copy to dev/archery/archery/utils/logger.py
index 6bb237a..9d0feda 100644
--- a/.gitignore
+++ b/dev/archery/archery/utils/logger.py
@@ -15,46 +15,15 @@
# specific language governing permissions and limitations
# under the License.
-apache-rat-*.jar
-arrow-src.tar
-arrow-src.tar.gz
+import logging
-# Compiled source
-*.a
-*.dll
-*.o
-*.py[ocd]
-*.so
-*.so.*
-*.dylib
-.build_cache_dir
-dependency-reduced-pom.xml
-MANIFEST
-compile_commands.json
-build.ninja
+""" Global logger. """
+logger = logging.getLogger("archery")
-# Generated Visual Studio files
-*.vcxproj
-*.vcxproj.*
-*.sln
-*.iml
-# Linux perf sample data
-perf.data
-perf.data.old
+class LoggingContext:
+ def __init__(self, quiet=False):
+ self.quiet = quiet
-cpp/.idea/
-cpp/apidoc/xml/
-docs/example.gz
-docs/example1.dat
-docs/example3.dat
-python/.eggs/
-python/doc/
-.vscode
-.idea/
-.pytest_cache/
-pkgs
-.Rproj.user
-arrow.Rcheck/
-docker_cache
+ctx = LoggingContext()
diff --git a/dev/archery/archery/utils/source.py b/dev/archery/archery/utils/source.py
new file mode 100644
index 0000000..12dc735
--- /dev/null
+++ b/dev/archery/archery/utils/source.py
@@ -0,0 +1,141 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+
+from .git import git
+
+
+class ArrowSources:
+ """ ArrowSources is a companion class representing a directory containing
+ Apache Arrow's sources.
+ """
+ # Note that WORKSPACE is a reserved git revision name by this module to
+ # reference the current git workspace. In other words, this indicates to
+ # ArrowSources.at_revision that no cloning/checkout is required.
+ WORKSPACE = "WORKSPACE"
+
+ def __init__(self, path):
+ """ Initialize an ArrowSources
+
+ The caller must ensure that path is valid arrow source directory (can
+ be checked with ArrowSources.valid)
+
+ Parameters
+ ----------
+ path : src
+ """
+ assert isinstance(path, str) and ArrowSources.valid(path)
+ self.path = path
+
+ @property
+ def cpp(self):
+ """ Returns the cpp directory of an Arrow sources. """
+ return os.path.join(self.path, "cpp")
+
+ @property
+ def python(self):
+ """ Returns the python directory of an Arrow sources. """
+ return os.path.join(self.path, "python")
+
+ @property
+ def git_backed(self):
+ """ Indicate if the sources are backed by git. """
+ git_path = os.path.join(self.path, ".git")
+ return os.path.exists(git_path)
+
+ def at_revision(self, revision, clone_dir):
+ """ Return a copy of the current sources for a specified git revision.
+
+ This method may return the current object if no checkout is required.
+ The caller is responsible to remove the cloned repository directory.
+
+ The user can use the special WORKSPACE token to mean the current git
+ workspace (no checkout performed).
+
+ The second value of the returned tuple indicates if a clone was
+ performed.
+
+ Parameters
+ ----------
+ revision : str
+ Revision to checkout sources at.
+ clone_dir : str
+ Path to checkout the local clone.
+ """
+ if not self.git_backed:
+ raise ValueError(f"{self} is not backed by git")
+
+ if revision == ArrowSources.WORKSPACE:
+ return self, False
+
+ # A local clone is required to leave the current sources intact such
+ # that builds depending on said sources are not invalidated (or worse
+ # slightly affected when re-invoking the generator).
+ git.clone("--local", self.path, clone_dir)
+ git.checkout(revision, git_dir=clone_dir)
+
+ return ArrowSources(clone_dir), True
+
+ @staticmethod
+ def valid(src):
+ """ Indicate if current sources are valid. """
+ if isinstance(src, ArrowSources):
+ return True
+ if isinstance(src, str):
+ cpp_path = os.path.join(src, "cpp")
+ cmake_path = os.path.join(cpp_path, "CMakeLists.txt")
+ return os.path.exists(cmake_path)
+ return False
+
+ @staticmethod
+ def find(path=None):
+ """ Infer Arrow sources directory from various method.
+
+ The following guesses are done in order until a valid match is found:
+
+ 1. Checks the given optional parameter.
+
+ 2. Checks if the environment variable `ARROW_SRC` is defined and use
+ this.
+
+ 3. Checks if the current working directory (cwd) is an Arrow source
+ directory.
+
+ 4. Checks if this file (cli.py) is still in the original source
+ repository. If so, returns the relative path to the source
+ directory.
+ """
+
+ # Explicit via environment
+ env = os.environ.get("ARROW_SRC")
+
+ # Implicit via cwd
+ cwd = os.getcwd()
+
+ # Implicit via current file
+ this_dir = os.path.dirname(os.path.realpath(__file__))
+ this = os.path.join(this_dir, "..", "..", "..", "..")
+
+ for p in [path, env, cwd, this]:
+ if ArrowSources.valid(p):
+ return ArrowSources(p)
+
+ return None
+
+ def __repr__(self):
+ return f"{self.path}"
diff --git a/.gitignore b/dev/archery/setup.py
similarity index 60%
copy from .gitignore
copy to dev/archery/setup.py
index 6bb237a..2cf692c 100644
--- a/.gitignore
+++ b/dev/archery/setup.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
@@ -15,46 +16,25 @@
# specific language governing permissions and limitations
# under the License.
-apache-rat-*.jar
-arrow-src.tar
-arrow-src.tar.gz
+import sys
+from setuptools import setup
-# Compiled source
-*.a
-*.dll
-*.o
-*.py[ocd]
-*.so
-*.so.*
-*.dylib
-.build_cache_dir
-dependency-reduced-pom.xml
-MANIFEST
-compile_commands.json
-build.ninja
-# Generated Visual Studio files
-*.vcxproj
-*.vcxproj.*
-*.sln
-*.iml
+if sys.version_info < (3, 5):
+ sys.exit('Python < 3.5 is not supported')
-# Linux perf sample data
-perf.data
-perf.data.old
-cpp/.idea/
-cpp/apidoc/xml/
-docs/example.gz
-docs/example1.dat
-docs/example3.dat
-python/.eggs/
-python/doc/
-
-.vscode
-.idea/
-.pytest_cache/
-pkgs
-.Rproj.user
-arrow.Rcheck/
-docker_cache
+setup(
+ name='archery',
+ version="0.1.0",
+ description='Apache Arrow Developers Tools',
+ url='http://github.com/apache/arrow',
+ maintainer='Arrow Developers',
+ maintainer_email='dev@arrow.apache.org',
+ packages=['archery'],
+ install_requires=['click', 'pandas'],
+ entry_points='''
+ [console_scripts]
+ archery=archery.cli:archery
+ ''',
+)
diff --git a/.gitignore b/dev/archery/tests/test_benchmarks.py
similarity index 51%
copy from .gitignore
copy to dev/archery/tests/test_benchmarks.py
index 6bb237a..d199a40 100644
--- a/.gitignore
+++ b/dev/archery/tests/test_benchmarks.py
@@ -15,46 +15,25 @@
# specific language governing permissions and limitations
# under the License.
-apache-rat-*.jar
-arrow-src.tar
-arrow-src.tar.gz
+from archery.benchmark.core import Benchmark
+from archery.benchmark.compare import BenchmarkComparator
-# Compiled source
-*.a
-*.dll
-*.o
-*.py[ocd]
-*.so
-*.so.*
-*.dylib
-.build_cache_dir
-dependency-reduced-pom.xml
-MANIFEST
-compile_commands.json
-build.ninja
-# Generated Visual Studio files
-*.vcxproj
-*.vcxproj.*
-*.sln
-*.iml
+def test_benchmark_comparator():
+ unit = "micros"
-# Linux perf sample data
-perf.data
-perf.data.old
+ assert not BenchmarkComparator(
+ Benchmark("contender", unit, True, [10]),
+ Benchmark("baseline", unit, True, [20])).regression
-cpp/.idea/
-cpp/apidoc/xml/
-docs/example.gz
-docs/example1.dat
-docs/example3.dat
-python/.eggs/
-python/doc/
+ assert BenchmarkComparator(
+ Benchmark("contender", unit, False, [10]),
+ Benchmark("baseline", unit, False, [20])).regression
-.vscode
-.idea/
-.pytest_cache/
-pkgs
-.Rproj.user
-arrow.Rcheck/
-docker_cache
+ assert BenchmarkComparator(
+ Benchmark("contender", unit, True, [20]),
+ Benchmark("baseline", unit, True, [10])).regression
+
+ assert not BenchmarkComparator(
+ Benchmark("contender", unit, False, [20]),
+ Benchmark("baseline", unit, False, [10])).regression
diff --git a/docs/source/developers/benchmarks.rst b/docs/source/developers/benchmarks.rst
new file mode 100644
index 0000000..d0e6f1b
--- /dev/null
+++ b/docs/source/developers/benchmarks.rst
@@ -0,0 +1,127 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _benchmarks:
+
+**********
+Benchmarks
+**********
+
+Archery
+=======
+
+``archery`` is a python library and command line utility made to interact with
+Arrow's sources. The main feature is the benchmarking process.
+
+Installation
+~~~~~~~~~~~~
+
+The simplest way to install archery is with pip from the top-level directory.
+It is recommended to use the ``-e,--editable`` flag so that pip don't copy
+the module files but uses the actual sources.
+
+.. code-block:: shell
+
+ pip install -e dev/archery
+ archery --help
+
+ # optional: enable bash/zsh autocompletion
+ eval "$(_ARCHERY_COMPLETE=source archery)"
+
+Comparison
+==========
+
+One goal with benchmarking is to detect performance regressions. To this end,
+``archery`` implements a benchmark comparison facility via the ``benchmark
+diff`` command.
+
+In the default invocation, it will compare the current source (known as the
+current workspace in git) with local master branch.
+
+For more information, invoke the ``archery benchmark diff --help`` command for
+multiple examples of invocation.
+
+Iterating efficiently
+~~~~~~~~~~~~~~~~~~~~~
+
+Iterating with benchmark development can be a tedious process due to long
+build time and long run times. ``archery benchmark diff`` provides 2 methods
+to reduce this overhead.
+
+First, the benchmark command supports comparing existing
+build directories, This can be paired with the ``--preserve`` flag to
+avoid rebuilding sources from zero.
+
+.. code-block:: shell
+
+ # First invocation clone and checkouts in a temporary directory. The
+ # directory is preserved with --preserve
+ archery benchmark diff --preserve
+
+ # Modify C++ sources
+
+ # Re-run benchmark in the previously created build directory.
+ archery benchmark diff /tmp/arrow-bench*/{WORKSPACE,master}/build
+
+Second, the benchmark command supports filtering suites (``--suite-filter``)
+and benchmarks (``--benchmark-filter``), both options supports regular
+expressions.
+
+.. code-block:: shell
+
+ # Taking over a previous run, but only filtering for benchmarks matching
+ # `Kernel` and suite matching `compute-aggregate`.
+ archery benchmark diff \
+ --suite-filter=compute-aggregate --benchmark-filter=Kernel \
+ /tmp/arrow-bench*/{WORKSPACE,master}/build
+
+Both methods can be combined.
+
+Regression detection
+====================
+
+Writing a benchmark
+~~~~~~~~~~~~~~~~~~~
+
+1. The benchmark command will filter (by default) benchmarks with the regular
+ expression ``^Regression``. This way, not all benchmarks are run by default.
+ Thus, if you want your benchmark to be verified for regression
+ automatically, the name must match.
+
+2. The benchmark command will run with the ``--benchmark_repetitions=K``
+ options for statistical significance. Thus, a benchmark should not override
+ the repetitions in the (C++) benchmark's arguments definition.
+
+3. Due to #2, a benchmark should run sufficiently fast. Often, when the input
+ does not fit in memory (L2/L3), the benchmark will be memory bound instead
+ of CPU bound. In this case, the input can be downsized.
+
+Scripting
+=========
+
+``archery`` is written as a python library with a command line frontend. The
+library can be imported to automate some tasks.
+
+Some invocation of the command line interface can be quite verbose due to build
+output. This can be controlled/avoided with the ``--quiet`` option, e.g.
+
+.. code-block:: shell
+
+ archery --quiet benchmark diff --benchmark-filter=Kernel
+ {"benchmark": "BenchSumKernel/32768/0", "change": -0.6498, "regression": true, ...
+ {"benchmark": "BenchSumKernel/32768/1", "change": 0.01553, "regression": false, ...
+ ...
diff --git a/docs/source/developers/index.rst b/docs/source/developers/index.rst
index a58f969..d309638 100644
--- a/docs/source/developers/index.rst
+++ b/docs/source/developers/index.rst
@@ -22,4 +22,5 @@
cpp
python
integration
+ benchmarks
documentation
diff --git a/python/.gitignore b/python/.gitignore
index 3346aa6..8f08f93 100644
--- a/python/.gitignore
+++ b/python/.gitignore
@@ -25,8 +25,6 @@ pyarrow/include
build
# setup.py dist directory
dist
-# Egg metadata
-*.egg-info
# Coverage
.coverage
coverage.xml