You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/15 14:09:43 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8457: ARROW-9164: [C++] Add embedded documentation to compute functions

jorisvandenbossche commented on a change in pull request #8457:
URL: https://github.com/apache/arrow/pull/8457#discussion_r505527509



##########
File path: cpp/src/arrow/compute/kernels/aggregate_var_std.cc
##########
@@ -180,20 +180,38 @@ void AddVarStdKernels(KernelInit init,
   }
 }
 
+FunctionDoc stddev_doc{
+    "Calculate the standard deviation of a numeric array",
+    ("The number of degrees of freedom can be controlled using VarianceOptions.\n"
+     "By default (`ddof` = 0), the population standard deviation is calculated.\n"
+     "Nulls are ignored.  If there are not enough non-null values in the array\n"
+     "to satisfy `ddof`, an error is returned."),

Review comment:
       At least on the Python side, it seems that null is returned instead of an error?
   
   ```
   In [19]: pc.stddev(pa.array([], pa.float64()))
   Out[19]: <pyarrow.DoubleScalar: None>
   ```
   ```

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##########
@@ -277,29 +278,80 @@ std::shared_ptr<ScalarFunction> MakeArithmeticFunction(std::string name) {
 // Like MakeArithmeticFunction, but for arithmetic ops that need to run
 // only on non-null output.
 template <typename Op>
-std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name) {
-  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary());
+std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name,
+                                                              FunctionDoc doc) {
+  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary(), std::move(doc));
   for (const auto& ty : NumericTypes()) {
     auto exec = NumericEqualTypesBinary<ScalarBinaryNotNullEqualTypes, Op>(ty);
     DCHECK_OK(func->AddKernel({ty, ty}, ty, exec));
   }
   return func;
 }
 
+FunctionDoc add_doc{"Add the elements of two arrays",

Review comment:
       Not necessarily two "arrays" .. Can also be scalar(s), etc. Now, not sure how to clearly rephrase this while still being concise but also not overly generic (like "objects"), so maybe keeping as arrays is fine. 

##########
File path: cpp/src/arrow/compute/kernels/scalar_boolean.cc
##########
@@ -171,21 +171,67 @@ void MakeFunction(std::string name, int arity, ArrayKernelExec exec,
   DCHECK_OK(registry->AddFunction(std::move(func)));
 }
 
+FunctionDoc invert_doc{"Invert boolean values", "", {"values"}};
+
+FunctionDoc and_doc{"Logical 'and' boolean values",
+                    ("When a null is encountered is either input, a null is output.\n"
+                     "For a different null behavior, see function \"and_kleene\"."),
+                    {"x", "y"}};
+
+FunctionDoc or_doc{"Logical 'or' boolean values",
+                   ("When a null is encountered is either input, a null is output.\n"
+                    "For a different null behavior, see function \"or_kleene\"."),
+                   {"x", "y"}};
+
+FunctionDoc xor_doc{"Logical 'xor' boolean values",
+                    ("When a null is encountered is either input, a null is output."),
+                    {"x", "y"}};
+
+FunctionDoc and_kleene_doc{
+    "Logical 'and' boolean values (Kleene logic)",
+    ("This function behaves as follows with nulls:\n\n"
+     "- true and null = null\n"
+     "- null and true = null\n"
+     "- false and null = false\n"
+     "- null and false = false\n"
+     "- null and null = null\n"
+     "\n"
+     "In other words, in this context a null value really means \"unknown\",\n"
+     "and an unknown value 'and' false is always false.\n"
+     "For a different null behavior, see function \"and\"."),
+    {"x", "y"}};
+
+FunctionDoc or_kleene_doc{
+    "Logical 'or' boolean values (Kleene logic)",
+    ("This function behaves as follows with nulls:\n\n"
+     "- true or null = true\n"
+     "- null and true = true\n"
+     "- false and null = null\n"
+     "- null and false = null\n"
+     "- null and null = null\n"
+     "\n"
+     "In other words, in this context a null value really means \"unknown\",\n"
+     "and an unknown value 'or' true is always true.\n"
+     "For a different null behavior, see function \"and\"."),

Review comment:
       ```suggestion
        "For a different null behavior, see function \"or\"."),
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_nested.cc
##########
@@ -65,18 +65,33 @@ Result<ValueDescr> ValuesType(KernelContext*, const std::vector<ValueDescr>& arg
   return ValueDescr::Array(list_type.value_type());
 }
 
+FunctionDoc list_flatten_doc(
+    "Flatten list values",
+    ("`lists` must have a list-like type.\n"
+     "Return an array with the top list level flattened.\n"
+     "Top-level null values in `lists` do not emit anything in the input."),
+    {"lists"});
+
+FunctionDoc list_parent_indices_doc(
+    "Compute parent indices of nested list values",
+    ("`lists` must have a list-like type.\n"
+     "For each value in each list of `lists`, the top-level list index\n"

Review comment:
       Not for this PR, but I think that for conceptually more complex functions (like this one, for me), it might be useful to include a small example. But that might get verbose to write in this form ..
   
   For example, for this case: `[[1, 2], [3, 4, 5], [6]]` returns `[0, 0, 1, 1, 1, 2]`

##########
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##########
@@ -302,19 +299,38 @@ void AddSortingKernels(VectorKernel base, VectorFunction* func) {
   }
 }
 
+FunctionDoc sort_indices_doc(
+    "Return the indices that would sort an array",
+    ("Compute an array of indices that define a non-stable sort\n"
+     "of the input array.  Null values are considered greater than any\n"
+     "other value and are therefore sorted at the end of the array."),
+    {"array"});
+
+FunctionDoc partition_nth_indices_doc(
+    "Return the indices that would partition an array around a pivot",
+    ("Compute an array of indices that define a non-stable partition\n"
+     "around the `N`th smallest element of the input array.\n"

Review comment:
       I think what you wrote in the compute.rst guide is clearer:
   
   > * \(1) The output is an array of indices into the input array, that define
     a partial sort such that the *N*'th index points to the *N*'th element
     in sorted order, and all indices before the *N*'th point to elements
     less or equal to elements at or after the *N*'th (similar to
     :func:`std::nth_element`).  *N* is given in
     :member:`PartitionNthOptions::pivot`.

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##########
@@ -277,29 +278,80 @@ std::shared_ptr<ScalarFunction> MakeArithmeticFunction(std::string name) {
 // Like MakeArithmeticFunction, but for arithmetic ops that need to run
 // only on non-null output.
 template <typename Op>
-std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name) {
-  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary());
+std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name,
+                                                              FunctionDoc doc) {
+  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary(), std::move(doc));
   for (const auto& ty : NumericTypes()) {
     auto exec = NumericEqualTypesBinary<ScalarBinaryNotNullEqualTypes, Op>(ty);
     DCHECK_OK(func->AddKernel({ty, ty}, ty, exec));
   }
   return func;
 }
 
+FunctionDoc add_doc{"Add the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"add_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc add_checked_doc{
+    "Add the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"add\"."),
+    {"x", "y"}};
+
+FunctionDoc sub_doc{"Substract the elements of an array from another",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"subtract_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc sub_checked_doc{
+    "Substract the elements of an array from another",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"subtract\"."),
+    {"x", "y"}};
+
+FunctionDoc mul_doc{"Multiply the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"multiply_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc mul_checked_doc{
+    "Multiply the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"multiply\"."),
+    {"x", "y"}};
+
+FunctionDoc div_doc{"Divide the elements of an array by another",
+                    ("An error is returned when trying to divide by zero.\n"

Review comment:
       This seems the case when dividing integers:
   
   ```
   In [29]: pc.divide([1, 2], 0)
   ...
   ArrowInvalid: divide by zero
   ```
   
   but not when dividing floats:
   
   ```
   In [28]: pc.divide([1., 2], 0.)
   Out[28]: 
   <pyarrow.lib.DoubleArray object at 0x7fab75767588>
   [
     inf,
     inf
   ]
   ```
   
   

##########
File path: cpp/src/arrow/compute/kernels/scalar_boolean.cc
##########
@@ -171,21 +171,67 @@ void MakeFunction(std::string name, int arity, ArrayKernelExec exec,
   DCHECK_OK(registry->AddFunction(std::move(func)));
 }
 
+FunctionDoc invert_doc{"Invert boolean values", "", {"values"}};
+
+FunctionDoc and_doc{"Logical 'and' boolean values",
+                    ("When a null is encountered is either input, a null is output.\n"

Review comment:
       ```suggestion
                       ("When a null is encountered in either input, a null is output.\n"
   ```
   
   and same for the 2 docstrings below

##########
File path: cpp/src/arrow/compute/kernels/scalar_validity.cc
##########
@@ -123,13 +123,21 @@ void IsNullExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
   }
 }
 
+FunctionDoc is_valid_doc(
+    "Return true if non-null",
+    ("For each input value, emit true iff the value is valid (non-null)."), {"input"});

Review comment:
       Regarding the argument name, for other unary functions above (where the input values were not of a specific type, like strings or lists), also "values" was used. Do we want to try to be somewhat consistent here? 

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##########
@@ -277,29 +278,80 @@ std::shared_ptr<ScalarFunction> MakeArithmeticFunction(std::string name) {
 // Like MakeArithmeticFunction, but for arithmetic ops that need to run
 // only on non-null output.
 template <typename Op>
-std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name) {
-  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary());
+std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name,
+                                                              FunctionDoc doc) {
+  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary(), std::move(doc));
   for (const auto& ty : NumericTypes()) {
     auto exec = NumericEqualTypesBinary<ScalarBinaryNotNullEqualTypes, Op>(ty);
     DCHECK_OK(func->AddKernel({ty, ty}, ty, exec));
   }
   return func;
 }
 
+FunctionDoc add_doc{"Add the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"add_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc add_checked_doc{
+    "Add the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"add\"."),
+    {"x", "y"}};
+
+FunctionDoc sub_doc{"Substract the elements of an array from another",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"subtract_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc sub_checked_doc{
+    "Substract the elements of an array from another",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"subtract\"."),
+    {"x", "y"}};
+
+FunctionDoc mul_doc{"Multiply the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"multiply_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc mul_checked_doc{
+    "Multiply the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"multiply\"."),
+    {"x", "y"}};
+
+FunctionDoc div_doc{"Divide the elements of an array by another",
+                    ("An error is returned when trying to divide by zero.\n"
+                     "However, integer overflow will silently wrap around.\n"
+                     "Use function \"divide_checked\" if you want integer\n"
+                     "overflow to also return an error."),
+                    {"dividend", "divisor"}};
+
+FunctionDoc div_checked_doc{
+    "Divide the elements of an array by another",
+    ("An error is returned when trying to divide by zero, or when\n"
+     "integer overflow is encountered.\n"),

Review comment:
       ```suggestion
        "integer overflow is encountered."),
   ```
   
   (the others don't end with a newline, and in python this results in two blank lines)

##########
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##########
@@ -302,19 +299,38 @@ void AddSortingKernels(VectorKernel base, VectorFunction* func) {
   }
 }
 
+FunctionDoc sort_indices_doc(
+    "Return the indices that would sort an array",
+    ("Compute an array of indices that define a non-stable sort\n"
+     "of the input array.  Null values are considered greater than any\n"
+     "other value and are therefore sorted at the end of the array."),

Review comment:
       I wanted to comment to maybe mention NaN as well (also sorted last, but before null). But apparently my expectation is not true, we don't really have well-defined sort behavior with NaNs present. Will open a JIRA about that
   
   For example
   
   ```
   In [39]: arr = pa.array([2, 1, np.nan, 2, 1])
   
   In [40]: pc.take(arr, pc.sort_indices(arr))
   Out[40]: 
   <pyarrow.lib.DoubleArray object at 0x7fc693d07c48>
   [
     1,
     2,
     nan,
     1,
     2
   ]
   
   In [41]: arr = pa.array([2, 1, np.nan, 2, np.nan, 1])
   
   In [42]: pc.take(arr, pc.sort_indices(arr))
   Out[42]: 
   <pyarrow.lib.DoubleArray object at 0x7fc693d10348>
   [
     1,
     1,
     2,
     nan,
     2,
     nan
   ]
   ```
   
   (for which I can't really infer the logic)

##########
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##########
@@ -302,19 +299,38 @@ void AddSortingKernels(VectorKernel base, VectorFunction* func) {
   }
 }
 
+FunctionDoc sort_indices_doc(
+    "Return the indices that would sort an array",
+    ("Compute an array of indices that define a non-stable sort\n"
+     "of the input array.  Null values are considered greater than any\n"
+     "other value and are therefore sorted at the end of the array."),
+    {"array"});
+
+FunctionDoc partition_nth_indices_doc(
+    "Return the indices that would partition an array around a pivot",
+    ("Compute an array of indices that define a non-stable partition\n"
+     "around the `N`th smallest element of the input array.\n"

Review comment:
       For me it's not fully clear what this "partition around the Nth smallest element" means.

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##########
@@ -277,29 +278,80 @@ std::shared_ptr<ScalarFunction> MakeArithmeticFunction(std::string name) {
 // Like MakeArithmeticFunction, but for arithmetic ops that need to run
 // only on non-null output.
 template <typename Op>
-std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name) {
-  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary());
+std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name,
+                                                              FunctionDoc doc) {
+  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary(), std::move(doc));
   for (const auto& ty : NumericTypes()) {
     auto exec = NumericEqualTypesBinary<ScalarBinaryNotNullEqualTypes, Op>(ty);
     DCHECK_OK(func->AddKernel({ty, ty}, ty, exec));
   }
   return func;
 }
 
+FunctionDoc add_doc{"Add the elements of two arrays",

Review comment:
       Just checked, for np.add ufunc, they use "Add arguments element-wise"

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##########
@@ -277,29 +278,80 @@ std::shared_ptr<ScalarFunction> MakeArithmeticFunction(std::string name) {
 // Like MakeArithmeticFunction, but for arithmetic ops that need to run
 // only on non-null output.
 template <typename Op>
-std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name) {
-  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary());
+std::shared_ptr<ScalarFunction> MakeArithmeticFunctionNotNull(std::string name,
+                                                              FunctionDoc doc) {
+  auto func = std::make_shared<ScalarFunction>(name, Arity::Binary(), std::move(doc));
   for (const auto& ty : NumericTypes()) {
     auto exec = NumericEqualTypesBinary<ScalarBinaryNotNullEqualTypes, Op>(ty);
     DCHECK_OK(func->AddKernel({ty, ty}, ty, exec));
   }
   return func;
 }
 
+FunctionDoc add_doc{"Add the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"add_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc add_checked_doc{
+    "Add the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"add\"."),
+    {"x", "y"}};
+
+FunctionDoc sub_doc{"Substract the elements of an array from another",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"subtract_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc sub_checked_doc{
+    "Substract the elements of an array from another",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"subtract\"."),
+    {"x", "y"}};
+
+FunctionDoc mul_doc{"Multiply the elements of two arrays",
+                    ("Results will wrap around on integer overflow.\n"
+                     "Use function \"multiply_checked\" if you want overflow\n"
+                     "to return an error."),
+                    {"x", "y"}};
+
+FunctionDoc mul_checked_doc{
+    "Multiply the elements of two arrays",
+    ("This function returns an error on overflow.  For a variant that\n"
+     "doesn't fail on overflow, use function \"multiply\"."),
+    {"x", "y"}};
+
+FunctionDoc div_doc{"Divide the elements of an array by another",
+                    ("An error is returned when trying to divide by zero.\n"

Review comment:
       It might also be worth to mention that integer division doesn't return floats, but is basically a "rounded" division (also not a floor division it seems)
   
   (if this is intentional, of course)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org