You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/11 21:40:58 UTC

[GitHub] [arrow] edponce opened a new pull request #10296: [C++] Add documentation for authoring compute kernels

edponce opened a new pull request #10296:
URL: https://github.com/apache/arrow/pull/10296


   This PR extends to the compute layer documentation by describing a developer's process for authoring new compute functions. It describes the commonly used files, data structures, and functions for understanding compute functions. Also, it provides a tutorial with examples.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r633591259



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.

Review comment:
       The structures with static `Call` functions (such as `struct Add`) which are used to aid in efficient construction of some kernels are not identical to the kernels:
   
   ```suggestion
   Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
   
   Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       I think a simple (un-templated, non-suite) C++ test case is a necessary code snippet. For example,
   
   ```c++
   // scalar_arithmetic_test.cc
   TEST(AbsoluteValue, IntegralInputs) {
     for (auto type : {int8(), int16(), int32(), int64()}) {
       CheckScalarUnary("absolute_value", int8(), "[]", int8(), "[]");
   
       CheckScalarUnary("absolute_value", int8(), "[0, -1, 1, 2, -2, 3, -3]", int8(),
                        "[0, 1, 1, 2, 2, 3, 3]");
     }
   }
   ```
   
   Note that the above compiles and executes *before* adding the absolute_value function (IMHO it's a useful clarification of intended behavior to add a failing test like this as early as possible in the addition of a new function).

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+

Review comment:
       Please describe this briefly, mentioning that it's used for interactive help in the bindings

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+

Review comment:
       Add an explanation and an example here please

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate

Review comment:
       ```suggestion
   Hash aggregate
   ~~~~~~~~~~~~~~
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------

Review comment:
       Try to avoid ambiguity between the instance of `compute::Function` which is being added to the registry and the convenience function which is being added to `api_scalar.h`. The former is the canonical definition. The latter is a wrapper for easy use from C++, and probably isn't necessary for most compute functions. In fact for simplicity it might be better to avoid modifying `api_scalar.h` in this tutorial

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")

Review comment:
       In general, please avoid use of macros like `SCALAR_ARITHMETIC_UNARY()` and other context specific helpers for this walk through, as they obscure the underlying C++ and require a user to look up the macro's definition to understand what they're writing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776153562



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------

Review comment:
       It may be helpful to say something about impacts on language bindings, assuming that most of these will use what is in the registry rather tan the C++ convenience function.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776191711



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       May also want to link to the [primer documentation](https://github.com/google/googletest/blob/master/docs/primer.md), as the different types of tests are a source for confusion, as indicated in https://stackoverflow.com/questions/58600728/what-is-the-difference-between-test-test-f-and-test-p




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776206399



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```
+
+Example of Unary String Kernel: ASCII Reverse
+=============================================
+
+1. Name
+    * String literal: "ascii_reverse"
+    * C++ function names: `AsciiReverse`
+1. Input/output types: String-like (Printable ASCII)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: String predicate
+1. Arity: Unary
+
+
+Example of Binary Arithmetic Kernel: Hypotenuse of Right-angled Triangle
+========================================================================
+
+1. Name
+    * String literal: "hypotenuse"
+    * C++ function names: `Hypotenuse`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Binary (length of each leg)
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```

Review comment:
       Use restructured text code block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776190406



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.

Review comment:
       A description of NotNull and other variants may be helpful, though this could also go in the main user documentation since it has performance and correctness implications.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776190580



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.

Review comment:
       This is vague. Add a link to an example kernel.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-1002398828


   One further consideration is interface design. This seems like it is still being stabilized but guiding principles would be helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776182997



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.

Review comment:
       ```suggestion
   * `arrow/util/int_util_internal.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/util/int_util_internal.h>`_  - defines utility functions
       * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in the `"portable_snippets" <https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h>`_ library.
   ```
   Links to the files are helpful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776178816



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the

Review comment:
       ```suggestion
   data. Can generally process array or scalar values. The size of the
   ```
   Do not expect this needs to be capitalized. This is a very helpful explanation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776190788



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.

Review comment:
       A link would be helpful for this as well. Perhaps choose one kernel and use it to illustrate the points that are made.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776153789



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+

Review comment:
       This would be helpful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776185073



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.

Review comment:
       ```suggestion
   * `arrow/compute/api_scalar.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.cc>`_  - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776174122



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.

Review comment:
       ```suggestion
   An introduction to compute functions is provided in `compute documentation <https://arrow.apache.org/docs/cpp/compute.html>`_ .
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776181377



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.

Review comment:
       ```suggestion
   Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage C++'s powerful template system to efficiently generate kernel methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
   ```
   Maybe a link to an online guide to templating in C++ as used in Arrow would be helpful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera edited a comment on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera edited a comment on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-854164867


   This is a sort of confusion I had when I first started writing kernels. 
   _"A 'scalar' is a single (non-array) element! But how come "Scalar functions" accept and produce arrays?"_ 
   But now I understand, even though arrays are passed, the function is applied on each scalar in the array independently.
   Do you this is something we'd want to explicitly discuss in the doc?
   May be use an alternative jargon like, "element-wise and vector functions"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera edited a comment on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera edited a comment on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-854164867


   This is a sort of confusion I had when I first started writing kernels. 
   _"A 'scalar' is a single (non-array) element! But how come "Scalar functions" accept and produce arrays?"_ 
   But now I understand, even though arrays are passed, the function is applied on each scalar in the array independently.
   Do you this is something we'd want to explicitly discuss in the doc?
   May be use an alternative jargon like, "element-wise and vector functions"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-1002779187


   @bkmgit Thanks for your reviews! I will get back to this PR and resolve them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776174414



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.

Review comment:
       ```suggestion
   The `compute submodule <https://https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute>`_ contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776188625



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h

Review comment:
       ```suggestion
   * `arrow/compute/exec.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/exec.hc>`_
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776206623



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```
+
+Example of Unary String Kernel: ASCII Reverse
+=============================================
+
+1. Name
+    * String literal: "ascii_reverse"
+    * C++ function names: `AsciiReverse`
+1. Input/output types: String-like (Printable ASCII)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: String predicate
+1. Arity: Unary
+
+
+Example of Binary Arithmetic Kernel: Hypotenuse of Right-angled Triangle
+========================================================================
+
+1. Name
+    * String literal: "hypotenuse"
+    * C++ function names: `Hypotenuse`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Binary (length of each leg)
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_BINARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")
+```

Review comment:
       ```suggestion
   .. code-block:: cpp
   
       SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")
   
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```
+
+Example of Unary String Kernel: ASCII Reverse
+=============================================
+
+1. Name
+    * String literal: "ascii_reverse"
+    * C++ function names: `AsciiReverse`
+1. Input/output types: String-like (Printable ASCII)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: String predicate
+1. Arity: Unary
+
+
+Example of Binary Arithmetic Kernel: Hypotenuse of Right-angled Triangle
+========================================================================
+
+1. Name
+    * String literal: "hypotenuse"
+    * C++ function names: `Hypotenuse`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Binary (length of each leg)
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_BINARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")
+```

Review comment:
       Use restructured text code block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776205136



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```

Review comment:
       ```suggestion
   .. code-block:: cpp
   
         auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
         auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776204194



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```

Review comment:
       Use restructured text code block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera commented on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-854164867


   This is a sort of confusion I had when I first started writing kernels. 
   _"A 'scalar' is a single (non-array) element! But how come "Scalar functions" accept and produce arrays?"_ 
   But now I understand, even though arrays are passed, the function is applied on each scalar in the array independently.
   Do you this is something we'd want to explicitly discuss in the doc?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera commented on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-854164867


   This is a sort of confusion I had when I first started writing kernels. 
   _"A 'scalar' is a single (non-array) element! But how come "Scalar functions" accept and produce arrays?"_ 
   But now I understand, even though arrays are passed, the function is applied on each scalar in the array independently.
   Do you this is something we'd want to explicitly discuss in the doc?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776196308



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")

Review comment:
       This was helpful for me. Perhaps it would be better to explain when such macros should be used and explain this particular macro so that these are used consistently within Arrow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776201959



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+

Review comment:
       ```suggestion
   .. code-block:: cpp
   
         struct AbsoluteValue {
           template <typename T, typename Arg>
           static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
         return (arg < static_cast<T>(0)) ? -arg : arg;
         }
   
         template <typename T, typename Arg>
         static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
           return arg;
         }
   
         template <typename T, typename Arg>
         static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
           return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
         }
       };
   
       struct AbsoluteValueChecked {
         template <typename T, typename Arg>
         static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
           static_assert(std::is_same<T, Arg>::value, "");
           if (arg < static_cast<T>(0)) {
               T result = 0;
               if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
                 *st = Status::Invalid("overflow");
               }
               return result;
           }
           return arg;
         }
   
   ```
   Use a [restructured text code block](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#directive-code-block)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776204039



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```

Review comment:
       ```suggestion
   .. code-block:: cpp
   
       const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
                                    ("Results will wrap around on integer overflow.\n"
                                     "Use function \"absolute_value_checked\" if you want overflow\n"
                                     "to return an error."),
                                    {"x"}};
   
       const FunctionDoc absolute_value_checked_doc{
           "Calculate the absolute value of the argument element-wise",
           ("This function returns an error on overflow.  For a variant that\n"
            "doesn't fail on overflow, use function \"absolute_value_checked\"."),
           {"x"}};
       ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r635520473



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate

Review comment:
       In addition, please ensure that all your links are in the RST format. For example, to create a link to the doxygen doc for a specific class member, use:
   ```rst
    :member:`ScalarKernel::exec`
   ```
   
   To create a link to a specific source file on the `master` branch, use:
   ```rst
   `The scalar API header <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h>`__
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-839230431


   https://issues.apache.org/jira/browse/ARROW-12724


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r645107090



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+

Review comment:
       Shall we add the comment by @pitrou in Zulip here. 
   ```
   Simple questions for whether a function is a scalar function:
   - Do all inputs have the same (broadcasted) length?
   - Does the Nth element in the output only depend on the Nth element of each input?
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       Shall we also add untyped test fixtures (`TEST_F`s)?

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======

Review comment:
       I think it'd be nicer if we could discuss about some helper methods in `codegen_internal.h`.

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+

Review comment:
       I think we should discuss the guarantees provided by the `compute` infrastructure as well. 
   ex: for scalar functions, if multiple arrays are passed, the compute infrastructure checks for nullity, guarantees that they are of same size, etc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776180954



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.

Review comment:
       ```suggestion
   Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776178425



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.

Review comment:
       ```suggestion
   * Compute functions (see `FunctionImpl and subclasses <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_) contain `"kernels" <https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels>`_ which are implementations for specific argument signatures.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776184318



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains

Review comment:
       ```suggestion
   * `arrow/compute/api_scalar.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.h>`_  - contains
   ```
   Consistent abbreviated path names make readability easier.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776205252



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```

Review comment:
       Use restructured text code block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776205536



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```

Review comment:
       ```suggestion
   .. code-block:: cpp
   
         DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
         DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10296: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#issuecomment-839197294


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776187613



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).

Review comment:
       [Underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow) may also be problematic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776179235



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**

Review comment:
       ```suggestion
   **Categories of Scalar Functions**
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776178898



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.

Review comment:
       ```suggestion
   of mixing array and scalar inputs) of the input.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776177694



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings

Review comment:
       ```suggestion
   * A unique :doc:`name <.compute>`  used for function invocation and language bindings
   ```
   Adding a section heading to the approriate part of the compute documentation will give a better link. See [restructured text documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#cross-referencing-arbitrary-locations)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776202733



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```

Review comment:
       Continuing formatting to a restructured text code block.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776208010



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.

Review comment:
       Maybe indicate that example code links are provided in a later section.

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.

Review comment:
       Maybe indicate that example code links are provided in a later section.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776172015



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.

Review comment:
       ```suggestion
   * The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of input data types or function behavior options.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776180229



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.

Review comment:
       ```suggestion
   A function with array input and output whose behavior depends on combinations of
   values at different locations in the input arrays, rather than the independent computations
   on scalar values at the same location in the input arrays.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r630822980



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.

Review comment:
       Can you wrap long lines?

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.

Review comment:
       Also, it seems you're using Markdown syntax. These documents should be authored using [restructuredText](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] 9prady9 commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
9prady9 commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r816467140



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.

Review comment:
       ```suggestion
   * A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must ensure a kernel corresponding to the value types of the inputs is selected.
   ```
   or
   ```suggestion
   * A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must select a kernel corresponding to the value types of the inputs.
   ```
   ?

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments

Review comment:
       ```suggestion
   * An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) states the number of required arguments
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use

Review comment:
       ```suggestion
     indicates in what context it is valid for use
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.

Review comment:
       ```suggestion
   This section describes the general structure of files/directory and principal code structures of the compute layer using scalar function <pick something>.
   ```
   followed by bullet points on how this function exists in code base. The raw information is present even now in the current list of points, but it feels like a flow of information is missing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776174723



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:

Review comment:
       ```suggestion
   `Compute functions <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_  have the following principal attributes:
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776182997



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.

Review comment:
       ```suggestion
   * `arrow/util/int_util_internal.h <https://github.com/apache/arrow/util/int_util_internal.h>`_  - defines utility functions
       * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in the `"portable_snippets" <https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h>`_ library.
   ```
   Links to the files are helpful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776180707



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate

Review comment:
       ```suggestion
   **Categories of Vector Functions**
   
   * Associative Transforms
   * Selections
   * Sorts and Partitions
   * Structural Transforms
   
   
   Scalar Aggregate
   ~~~~~~~~~~~~~~~~
   
   A function that computes scalar summary statistics from array input.
   
   ### Hash Aggregate
   ```
   Consistent capitalization




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776179479



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms

Review comment:
       ```suggestion
       * Predicates
       * Transforms
       * Trimming
       * Splitting
       * Extraction
   * Containment Tests
   * Structural Transforms
   ```
   More consistent capitalization




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776186089



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).

Review comment:
       ```suggestion
       * Macros of the form `SCALAR_*` invoke `CallFunction` require two function names, default which behaves like `SCALAR_EAGER_*`, and `_checked` variant which checks for overflow and underflow in the calculations.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776186089



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).

Review comment:
       ```suggestion
       * Macros of the form `SCALAR_*` invoke `CallFunction` require two function names, default which behaves like `SCALAR_EAGER_*`, and `_checked` variant which checks for overflow in the calculations.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776171860



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.

Review comment:
       ```suggestion
   Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time with output length the same as the input length. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
   ```
   Maybe this is clearer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776186962



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).

Review comment:
       ```suggestion
   *  `arrow/compute/kernels/scalar_arithmetic.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc>`compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776186962



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).

Review comment:
       ```suggestion
   *  `arrow/compute/kernels/scalar_arithmetic.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc>`_  - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776195216



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays

Review comment:
       ```suggestion
   1. Input types: Numerical (signed and unsigned, integral and floating-point)
   1. Input shapes: Operate on scalars or element-wise for arrays
   1. Output types: Numerical, same as input
   1. Output shapes: Same as input shape
   ```
   The [documentation](https://arrow.apache.org/docs/cpp/compute.html) does not combine Input and Output,
   which improves clarity because dependence of output on input can be made clear.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776202539



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```

Review comment:
       ```suggestion
         template <typename T, typename Arg>
         static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
           static_assert(std::is_same<T, Arg>::value, "");
           return arg;
         }
   
         template <typename T, typename Arg>
         static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
           static_assert(std::is_same<T, Arg>::value, "");
           return (arg < static_cast<T>(0)) ? -arg : arg;
         }
       };
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776206224



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```
+
+Example of Unary String Kernel: ASCII Reverse
+=============================================
+
+1. Name
+    * String literal: "ascii_reverse"
+    * C++ function names: `AsciiReverse`
+1. Input/output types: String-like (Printable ASCII)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: String predicate
+1. Arity: Unary
+
+
+Example of Binary Arithmetic Kernel: Hypotenuse of Right-angled Triangle
+========================================================================
+
+1. Name
+    * String literal: "hypotenuse"
+    * C++ function names: `Hypotenuse`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Binary (length of each leg)
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```

Review comment:
       ```suggestion
   .. code-block:: cpp
   
       ARROW_EXPORT
       Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(),
                                                     ExecContext* ctx = NULLPTR);
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
bkmgit commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r776205653



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======
+
+* `MakeArray` - convert a `Datum` to an ...
+* `ArrayFromJSON(type_id, format string)` -  `ArrayFromJSON(float32, "[1.3, 10.80, NaN, Inf, null]")`
+
+
+Benchmarking
+------------
+
+
+Example of Unary Arithmetic Function: Absolute Value
+====================================================
+
+Identify the principal attributes.
+
+1. Name
+    * String literal: "absolute_value"
+    * C++ function names: `AbsoluteValue`
+1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
+1. Input/output shapes: operate on scalars or element-wise for arrays
+1. Kind: Scalar
+    * Category: Arithmetic
+1. Arity: Unary
+
+
+Define compute function
+-----------------------
+
+Add compute function's prototype to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h
+
+```C++
+ARROW_EXPORT
+Result<Datum> AbsoluteValue(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
+```
+
+Add compute function's definition to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc
+Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).
+
+```C++
+SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
+```
+
+Define kernels of compute function
+----------------------------------
+
+The absolute value operation can overflow for signed integral inputs, so we need to define "safe" functions using the `portable_snippets` library.
+
+```C++
+SIGNED_UNARY_OPS_WITH_OVERFLOW(AbsoluteValueWithOverflow, abs)
+```
+
+Given that this is a "Scalar Arithmetic" function, its kernels will be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+```C++
+struct AbsoluteValue {
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
+  }
+};
+
+struct AbsoluteValueChecked {
+  template <typename T, typename Arg>
+  static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    if (arg < static_cast<T>(0)) {
+        T result = 0;
+        if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
+          *st = Status::Invalid("overflow");
+        }
+        return result;
+    }
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return arg;
+  }
+
+  template <typename T, typename Arg>
+  static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
+    static_assert(std::is_same<T, Arg>::value, "");
+    return (arg < static_cast<T>(0)) ? -arg : arg;
+  }
+};
+```
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+```C++
+const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
+                             ("Results will wrap around on integer overflow.\n"
+                              "Use function \"absolute_value_checked\" if you want overflow\n"
+                              "to return an error."),
+                             {"x"}};
+
+const FunctionDoc absolute_value_checked_doc{
+    "Calculate the absolute value of the argument element-wise",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"absolute_value_checked\"."),
+    {"x"}};
+```
+
+Register kernels of compute function
+------------------------------------
+
+1. For the case of absolute value, the kernel generator `MakeUnaryArithmeticFunctionNotNull` was not available so it was added.
+
+
+1. Create the kernels by invoking the kernel generators.
+```C++
+  auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
+  auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
+```
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+```C++
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
+  DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
+```

Review comment:
       Use restructured text code block




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nirandaperera commented on a change in pull request #10296: ARROW-12724: [C++] Add documentation for authoring compute kernels

Posted by GitBox <gi...@apache.org>.
nirandaperera commented on a change in pull request #10296:
URL: https://github.com/apache/arrow/pull/10296#discussion_r645107090



##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+

Review comment:
       Shall we add the comment by @pitrou in Zulip here. 
   ```
   Simple questions for whether a function is a scalar function:
   - Do all inputs have the same (broadcasted) length?
   - Does the Nth element in the output only depend on the Nth element of each input?
   ```

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,421 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are defined as `structs` with the same name as the compute function's API. These `structs` contain static *Call* methods representing the unique implementation for each argument signature. Apache Arrow conforms to SFINAE and aliased-template conditionals to generalize kernel implementations for different argument types. Also, kernel implementations can have the *constexpr* specifier if applicable.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+

Review comment:
       Shall we also add untyped test fixtures (`TEST_F`s)?

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+
+
+Kernel dispatcher
+-----------------
+
+* compute/exec.h
+    * Defines variants of `CallFunction` which are the one-shot functions for invoking compute functions. A compute function should invoke `CallFunction` in its definition.
+    * Defines `ExecContext` class
+    * ScalarExecutor applies scalar function to batch
+    * ExecBatchIterator::Make
+
+* `DispatchBest`
+
+* `FunctionRegistry` is the class representing a function registry. By default there is a single global registry where all kernels reside. `ExecContext` maintains a reference to the registry, if reference is NULL then the default registry is used.
+
+* aggregate_basic.cc, aggregate_basic_internal.h - example of passing options to kernel
+    * scalaraggregator
+
+
+Portable snippets for safe (integer) math
+-----------------------------------------
+
+Arithmetic functions which can trigger integral overflow use the vendored library `portable_snippets` to perform "safe math" operations (e.g., arithmetic, logical shifts, casts).
+Kernel implementations suffixed with `WithOverflow` need to be defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h for each primitive datatype supported. Use the helper macros of the form `*OPS_WITH_OVERFLOW` to automatically generate the definitions. This file also contains helper functions for performing safe integral arithmetic for the kernels' default variant.
+
+The short-hand name maps to the predefined operation names in https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h#L1028-L1033. For example, `OPS_WITH_OVERFLOW(AddWithOverflow, add)` uses short-hand name `add`.
+
+
+Adding a new compute function
+=============================
+
+This section describes the process for adding a new compute function and associated kernel implementations.
+
+First, you should identify the principal attributes of the new compute function.
+The following series of steps help guide the design process.
+
+1. Decide on a unique name that fully represents the function's operation
+
+   Browse the [available compute functions](https://arrow.apache.org/docs/cpp/compute.html#available-functions) to prevent a name collision. Note that the long form of names is preferred, and multi-word names are allowed due to the fact that string versions use an underscore instead of whitespace and C++ function names use camel case convention.
+     * What is a representative and unambiguous name for the operation performed by the compute function?
+     * If a related or variant form of a compute function is to be added in the future, is the current name extensible or specific enough to allow room for clear differentiation? For example, `str_length` is not a good name because there are different types of strings, so in this case it is preferable to be specific with `ascii_length` and `utf8_length`.
+
+1. Identify the input/output types/shapes
+    * What are the input types/shapes supported?
+    * If multiple inputs are expected, are they the same type/shape?
+
+1. Identify the compute function "kind" based on its operation and #2.
+    * Does the codebase of the "kind" provides full support for the new compute function?
+        * If not, is it straightforward to add the missing parts or can the new compute function be supported by another "kind"?
+
+
+Define compute function
+-----------------------
+
+Add the compute function prototype and definition to the corresponding source files based on its "kind". For example the API of a "Scalar" function is found in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h and its definition in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.cc.
+
+
+
+Define kernels of compute function
+----------------------------------
+
+Define the kernel implementations in the corresponding source file based on the compute function's "kind" and category. For example, a "Scalar" arithmetic function has kernels defined in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc.
+
+Create compute function documentation (`FunctionDoc` object)
+------------------------------------------------------------
+
+Each compute function has documentation which includes a summary, description, and argument types of its operation. A `FunctionDoc` object is instantiated and used in the registration step. Note that for compute functions that can overflow, another `FunctionDoc` is required for the `_checked` variant.
+
+Register kernels of compute function
+------------------------------------
+
+1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
+
+1. Create the kernels by invoking the kernel generators.
+
+1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
+
+
+Testing
+-------
+
+Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.
+
+`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.
+
+Helpers
+=======

Review comment:
       I think it'd be nicer if we could discuss about some helper methods in `codegen_internal.h`.

##########
File path: docs/source/cpp/authoring_compute_functions.rst
##########
@@ -0,0 +1,423 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+===========================
+Authoring Compute Functions
+===========================
+
+Compute Functions
+=================
+
+An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
+
+The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
+
+Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
+
+Terminology:
+* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
+* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
+* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.
+
+[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
+* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
+* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
+  which indicates in what context it is valid for use
+    * Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
+    * Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
+* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
+* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
+for its core operation. Functions are commonly nullary, unary, binary, or ternary, but can also be variadic.
+* ["Documentation"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE) describing the function's functionality and behavior
+* ["Options"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE) specifying configuration of the function's behavior.
+
+Compute functions are grouped in source files based on their "kind" in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
+Kernels of compute functions are grouped in source files based on their "kind" and category, see https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels.
+
+
+Kinds of compute functions
+--------------------------
+
+Arrow uses an enumerated type to identify the kind of a compute function, refer to
+https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/function.h
+
+Scalar
+~~~~~~
+
+A function that performs scalar data operations on whole arrays of
+data. Can generally process Array or Scalar values. The size of the
+output will be the same as the size (or broadcasted size, in the case
+of mixing Array and Scalar inputs) of the input.
+
+https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions
+
+**Categories of Scalar functions**
+
+* Arithmetic
+* Comparisons
+* Logical
+* String
+    * predicates
+    * transforms
+    * trimming
+    * splitting
+    * extraction
+* Containment tests
+* Structural transforms
+* Conversions
+
+
+Vector
+~~~~~~
+
+A function with array input and output whose behavior depends on the
+values of the entire arrays passed, rather than the value of each scalar value.
+
+**Categories of Vector functions**
+
+* Associative transforms
+* Selections
+* Sorts and partitions
+* Structural transforms
+
+
+Scalar aggregate
+~~~~~~~~~~~~~~~~
+
+A function that computes scalar summary statistics from array input.
+
+### Hash aggregate
+
+A function that computes grouped summary statistics from array input
+and an array of group identifiers.
+
+Meta
+~~~~
+
+A function that dispatches to other functions and does not contain its own kernels.
+
+
+
+Kernels
+-------
+
+Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
+
+Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
+
+Function options
+----------------
+
+[FunctionOptions](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute15FunctionOptionsE)
+
+
+Function documentation
+----------------------
+
+[FunctionDoc](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute11FunctionDocE)
+
+
+Files and structures of the computer layer
+==========================================
+
+This section describes the general structure of files/directory and principal code structures of the compute layer.
+
+* arrow/util/int_util_internal.h - defines utility functions
+    * Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
+
+* compute/api_scalar.h - contains
+    * Subclasses of `FunctionOptions` for specific categories of compute functions
+    * API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
+* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
+    * Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
+    * Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
+
+* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
+    * For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
+        * For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.
+

Review comment:
       I think we should discuss the guarantees provided by the `compute` infrastructure as well. 
   ex: for scalar functions, if multiple arrays are passed, the compute infrastructure checks for nullity, guarantees that they are of same size, etc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org