You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/27 21:25:45 UTC

[GitHub] [arrow] westonpace commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

westonpace commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r931518702


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.

Review Comment:
   ```suggestion
   To register a UDF, a function name, function docs, input types and output type need to be defined.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.

Review Comment:
   ```suggestion
   ```
   
   First, we probably don't need to call this out.  Second, we don't do a great job of defining "scalar" anywhere.
   
   A "compute function" doesn't have to be a scalar function, not even a custom UDF.  Nothing prevents you today from calling `register_scalar_udf` and providing a non-scalar function.
   
   However, if that compute function is going to be used in an a project node, then it must be a scalar function.  We won't detect that it is non-scalar (I'm not sure this is possible).  Instead, if it is non-scalar, the user will just get unexpected output (or an error, hopefully, if they are returning a different sized output).
   
   When we add aggregate functions do we envision a new `register_aggregate_function`?  I suppose it would be different in that it would take multiple functions and not just a single function.
   
   Maybe we could add a section at the end defining what a scalar function is when we talk about using these functions in the datasets API?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,

Review Comment:
   ```suggestion
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds

Review Comment:
   This iexample is "ok" but it seems a little nonsensical that the user would want 5.2*total_amount + 2.1.
   
   We should also discuss that this is equivalent to what you can do without defining a UDF here:
   
   ```
   'total_amount_projected($)': ds.field('')._call("add", [2.1, ds.field('')._call("multiply", [5.2, ds.field("total_amount($)"]))),
   ```
   
   Plus, we should probably address the `ds.field('')._call` issue before we worry too much about extensive documentation.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   If we are going to keep this example then we should:
   
    * Change the above `..note::` so that it isn't a note but a dedicated subsection.
    * Move the paragraph starting with "More generally," to come after the example.
    * Add a paragraph motivating the example.
   
   However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.

Review Comment:
   This paragraph should be replaced by the 'More generally" paragraph above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org