You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/03 21:53:26 UTC

[GitHub] [arrow] nirandaperera commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

nirandaperera commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r645107090



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+

Review comment:
       Shall we add the comment by @pitrou in Zulip here. 
   ```
   Simple questions for whether a function is a scalar function:
   - Do all inputs have the same (broadcasted) length?
   - Does the Nth element in the output only depend on the Nth element of each input?
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       Shall we also add untyped test fixtures (`TEST_F`s)?

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======

Review comment:
       I think it'd be nicer if we could discuss about some helper methods in `codegen_internal.h`.

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+

Review comment:
       I think we should discuss the guarantees provided by the `compute` infrastructure as well. 
   ex: for scalar functions, if multiple arrays are passed, the compute infrastructure checks for nullity, guarantees that they are of same size, etc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org