You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/22 08:10:13 UTC

[GitHub] [arrow] vibhatha opened a new pull request, #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

vibhatha opened a new pull request, #13687:
URL: https://github.com/apache/arrow/pull/13687

   This PR contains Python Scalar UDF documentation as an experimental version of docs. 
   At the moment we only support Scalar UDFs and the code snippets include how to use
   UDFs with PyArrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192589609

   @pitrou could you please help me with adding API docs? or explain how to do that? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927597597


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()

Review Comment:
   Yeah we should probably get floats to this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928100005


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   Well, the all-scalar scenario works with the original UDF? no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1262100163

   Thanks everyone for the reviews 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240409035

   >  I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function
   
   I think this would indeed be a more compelling example. 
   
   Another example could be a specific python functionality (eg something from `ipaddress`, to check or extract some information from strings that are supposed to be ipaddresses), although this will typically only work on scalars, and thus will be slow (but it's still an example how you can use this within arrow). Or another example could be a custom function implemented in numba. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240682470

   Please kind in mind the example should remain simple enough, and not involve unusual dependencies. So I would simply keep the gcd example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1256826439

   @jorisvandenbossche I updated the PR. Could you please take a look at the newly added section on scalar functions? It is the second paragraph of this added section. 
   
   cc @westonpace @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1193024571

   > @vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.
   
   Of course, sorry about the issues. Will make sure to avoid this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192719472

   > @vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)
   
   And also reference the given symbols in https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928048727


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+When all the inputs are scalar, the input is a size=1 array and the values has to be properly
+treated within the UDF. And also make sure to include the final output as a size=1 array.

Review Comment:
   Mmm… yes I will trim this. 



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+When all the inputs are scalar, the input is a size=1 array and the values has to be properly
+treated within the UDF. And also make sure to include the final output as a size=1 array.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.

Review Comment:
   👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1225176371

   > I think part of the challenge with this documentation is that implementing `affine` in pure-python is not a very compelling use case. I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's `gcd` function (greatest common divisor) as an Arrow function:
   > 
   > ```
   > import numpy as np
   > 
   > import pyarrow as pa
   > import pyarrow.compute as pc
   > 
   > function_name = "numpy_gcd"
   > function_docs = {
   >        "summary": "Calculates the greatest common divisor",
   >        "description":
   >            "Given 'x' and 'y' find the greatest number that divides\n"
   >            "evenly into both x and y."
   > }
   > 
   > input_types = {
   >    "x" : pa.int64(),
   >    "y" : pa.int64()
   > }
   > 
   > output_type = pa.int64()
   > 
   > def to_np(val):
   >     if isinstance(val, pa.Scalar):
   >         return val.as_py()
   >     else:
   >         return np.array(val)
   > 
   > def gcd_numpy(ctx, x, y):
   >     np_x = to_np(x)
   >     np_y = to_np(y)
   >     return pa.array(np.gcd(np_x, np_y))
   > 
   > pc.register_scalar_function(gcd_numpy,
   >                             function_name,
   >                             function_docs,
   >                             input_types,
   >                             output_type)
   > 
   > print('gcd(27, 63) should be 9')
   > print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
   > print()
   > print('gcd([27, 18], [54, 63]) should be [27, 9]')
   > print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
   > print()
   > print('gcd(27, [54, 18]) should be [27, 9]')
   > print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')
   > ```
   > 
   > Notice the use of the helper function `to_np` to convert from inputs of different shapes to ensure that we get something that numpy can work with.
   
   I see your point. I will update the example to use this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1261119849

   @jorisvandenbossche WDYT about this: https://github.com/apache/arrow/pull/13687#discussion_r982584769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r982972286


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   I understand your point. That's why the description included that idea. We didn't change the UDF kernels since it was merged. So for the moment I will document what's missing from the current docs. cc @westonpace @pitrou 
   What's the best? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r978199590


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   Ah not exactly. But it was added as a sub section to give clarity on what projection expression does and how it can be used with UDFs. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r977459728


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 

Review Comment:
   This paragraph doesn't look useful to me, just remove it?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.

Review Comment:
   ```suggestion
   PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
   and there will always be at least one input of type :class:`~pyarrow.Array`.
   The output should always be a :class:`~pyarrow.Array`.
   ```



##########
python/pyarrow/src/udf.h:
##########
@@ -41,6 +41,8 @@ struct ARROW_PYTHON_EXPORT ScalarUdfOptions {
   std::shared_ptr<DataType> output_type;
 };
 
+/// \brief A context defined to hold meta-data required in
+/// scalar UDF execution.

Review Comment:
   ```suggestion
   /// \brief A context passed as the first argument of scalar UDF functions.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:

Review Comment:
   ```suggestion
   column using :meth:`Expression._call`.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in another column by using a mathematical formula.
+For instance, let's consider applying `gcd` math operation.
+Here, we will be re-using the registered `numpy_gcd` function.

Review Comment:
   ```suggestion
   Consider an instance where the data is in a table and we want to compute
   the GCD of one column with the scalar value 30.  We will be re-using the
   "numpy_gcd" user-defined function that was created above:
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in another column by using a mathematical formula.
+For instance, let's consider applying `gcd` math operation.
+Here, we will be re-using the registered `numpy_gcd` function.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that the `ds.field('')_call()` returns an expression. The passed arguments
+to this function call are expressions not scalar values 
+(i.e `pc.scalar(30), ds.field("value")`, notice the difference 
+of `pa.scalar` vs `pc.scalar`, the latter produces an expression). 
+This expression is evaluated when the project operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (`gcd_value`)

Review Comment:
   ```suggestion
   In the above example we used an expression to add a new column (``gcd_value``)
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in another column by using a mathematical formula.
+For instance, let's consider applying `gcd` math operation.
+Here, we will be re-using the registered `numpy_gcd` function.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that the `ds.field('')_call()` returns an expression. The passed arguments
+to this function call are expressions not scalar values 
+(i.e `pc.scalar(30), ds.field("value")`, notice the difference 
+of `pa.scalar` vs `pc.scalar`, the latter produces an expression). 
+This expression is evaluated when the project operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (`gcd_value`)
+to our table.  Adding new, dynamically computed, columns to a table is known as "projection"
+and there are limitations on what kinds of functions can be used in projection expressions.
+A projection function must emit a single output value for each input row.  That output value
+should be calculated entirely from the input row and should not depend on any other row.
+For example, the "numpy_gcd" function that we've been using as an example above is a valid
+function to use in a projection.  A "cumulative sum" function would not be a valid function
+since the result of each input rows depends on the rows that came before.  A "drop nulls"

Review Comment:
   ```suggestion
   since the result of each input row depends on the rows that came before.  A "drop nulls"
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in another column by using a mathematical formula.
+For instance, let's consider applying `gcd` math operation.
+Here, we will be re-using the registered `numpy_gcd` function.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that the `ds.field('')_call()` returns an expression. The passed arguments
+to this function call are expressions not scalar values 
+(i.e `pc.scalar(30), ds.field("value")`, notice the difference 
+of `pa.scalar` vs `pc.scalar`, the latter produces an expression). 
+This expression is evaluated when the project operator executes it.

Review Comment:
   ```suggestion
   Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
   The arguments passed to this function call are expressions, not scalar values 
   (notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
   the latter produces an expression). 
   This expression is evaluated when the projection operator executes it.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 
+of scalar and array inputs to the UDF. Also, the final output is returned
+as a scalr or an array depending on the inputs. Based on the usage of any
+libraries inside the UDF, make sure it is generalized to support the passed
+input values and return suitable values.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function(""numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   9

Review Comment:
   ```suggestion
      >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
      <pyarrow.Int64Scalar: 9>
      >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
      <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
      [
        27,
        3,
        1
      ]
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.

Review Comment:
   Perhaps apply the suggestion as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240633209

   > This is a better addition for the cookbook. Or should we add something like that here too?
   
   I was just brainstorming a bit. I think it is certainly fine to limit to a single example here (and indeed the others could be nice for a cookbook). But I personally do think the numpy `gcd` is a more compelling example than the affine one, since that last one can be easily expressed with composing existing arrow kernels.
   
   > > > Plus, we should probably address the `ds.field('')._call` issue before we worry too much about extensive documentation.
   > 
   > I understand your point. Should we hold this PR until we resove this issue?
   
   This only comes up in the part about working with datasets, I think? So the rest of the PR is certainly already useful. It _would_ be good to not document the private `_call`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r977730323


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Note that there is a helper function `to_np` to handle the conversion 

Review Comment:
   I am removing the section after `Note that there is ...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r981230342


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name. 
+
+UDF support is limited to scalar functions. A scalar function is a function which
+executes elementwise operations on arrays or scalars. In general, the output of a
+scalar function do not depend on the order of values in the arguments. Note that 
+such functions have a rough correspondence to the functions used in SQL expressions.

Review Comment:
   ```suggestion
   UDF support is limited to scalar functions. A scalar function is a function which
   executes elementwise operations on arrays or scalars. In general, the output of a
   scalar function does not depend on the order of values in the arguments. Note that 
   such functions have a rough correspondence to the functions used in SQL expressions,
   or to NumPy `universal functions <https://numpy.org/doc/stable/reference/ufuncs.html>`_.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name. 
+
+UDF support is limited to scalar functions. A scalar function is a function which
+executes elementwise operations on arrays or scalars. In general, the output of a
+scalar function do not depend on the order of values in the arguments. Note that 
+such functions have a rough correspondence to the functions used in SQL expressions.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+       if isinstance(val, pa.Scalar):
+          return val.as_py()
+       else:
+          return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+       np_x = to_np(x)
+       np_y = to_np(y)
+       return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.

Review Comment:
   ```suggestion
   The implementation of a user-defined function always takes a first *context*
   parameter (named ``ctx`` in the example above) which is an instance of
   :class:`pyarrow.compute.ScalarUdfContext`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927614538


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Or `y=mx+b` simply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192588674

   > Also, can you add API docs for `register_scalar_function` and `ScalarUdfContext`?
   
   Do you mean include it in docs for Python? I am not sure how to link it properly. But the text is there for both `ScalarUdfContext` and `register_scalar_function` where they are defined. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928045854


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   I assumed you asked me to show the all scalar scenario in an example rather than wording. That’s why I added one to showcase that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928048079


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa

Review Comment:
   I was merely thinking about a user who is going to just copy the codeblock and run. They are separate code blocks and if a user picks one in the middle it won’t run. Of course we can remove it. But this what I thought. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r932036702


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.

Review Comment:
   I added a section for this too. Could you please check it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1201280509

   Re: https://github.com/apache/arrow/pull/13687#discussion_r934591177
   
   @pitrou we can use the other `affine` function
   But it beats the purpose right?
   I thought we decided to elaborate the a single sentence from an example. May be I misunderstood your point. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r931761946


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds

Review Comment:
   @westonpace I will modify the dataset and fit it with a proper scenario. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r929466232


##########
cpp/src/arrow/python/udf.h:
##########
@@ -41,6 +41,8 @@ struct ARROW_PYTHON_EXPORT ScalarUdfOptions {
   std::shared_ptr<DataType> output_type;
 };
 
+/// \brief A context defined to hold meta-data required in
+/// scalar UDF execution.

Review Comment:
   Any suggestions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1254824040

   @vibhatha Please be careful to update the submodules to match git master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1255134315

   @pitrou I updated the PR based on the new suggestions. Apologies about lengthy descriptions and the typos in the docs. Better to have another look at this 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r932036400


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds

Review Comment:
   I added a section about this. Could you please check if it is accurate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1201304345

   > The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer.
   
   Yes @pitrou that's the most important part. But shouldn't we point out how it could be used with a third-party library function. Let's say the user has already written a custom function which is an old python script, it doesn't use PyArrow compute API. I thought about that angel, that's why added the note. Is this something we don't need to discuss now and may be leave it for a different venue to discuss? Please correct me if I am wrong. 
   
   cc @westonpace Any thoughts? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1262371723

   Benchmark runs are scheduled for baseline = 60c938333995f2ac7399085d2ae90d2f5f3e33cd and contender = 902781d1f3a41563a23d6755433a8e40ce82de7b. 902781d1f3a41563a23d6755433a8e40ce82de7b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/77d316b7c7ef450d8762f9a11660ab8e...fcf8b5dfb04d4abea9bd7e01a14c6bf0/)
   [Failed :arrow_down:0.82% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/fd0911beffec48969d984384bfb3a487...81295bb2b48942be9d585a617f0c9bd4/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/c3328d56546942c6990d3801d04e973a...f9c4bf7487ed40988eebac252c8129e7/)
   [Finished :arrow_down:0.18% :arrow_up:0.07%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/46b73d562ea349f6b9721c309f970638...4b6a0326ce66445288f6dd6dce54d6d3/)
   Buildkite builds:
   [Finished] [`902781d1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1596)
   [Finished] [`902781d1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1614)
   [Failed] [`902781d1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1597)
   [Finished] [`902781d1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1609)
   [Finished] [`60c93833` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1595)
   [Failed] [`60c93833` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1613)
   [Failed] [`60c93833` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1596)
   [Finished] [`60c93833` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1608)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r982662806


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   OK, so it seems that we are _still_ passing scalars to the UDF in case of mixed scalar / array arguments, like you do in `pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])`. 
   
   So my understanding was wrong based on testing the case of two scalars: `pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(81)])`. That actually works by passing two length-1 arrays to the UDF. 
   (and for that reason I assumed we always pass scalars as length-1 arrays to the UDF implementation)
   
   Personally, I find this a bit confusing (but I am not familiar enough with the internals of the kernels to know if it is easy to always pass arrays to the UDF). And at least I would document that the arguments can be scalars, but never all of them will be scalar at the same time. 
   (since as I mentioned above, the current example UDF implementation doesn't work if it would be passed only scalars)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240634280

   > But I personally do think the numpy `gcd` is a more compelling example than the affine one, since that last one can be easily expressed with composing existing arrow kernels.
   
   I agree with that. We shouldn't encourage people to convert to Numpy if they can use native Arrow computations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927617142


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   "affine" perhaps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927806875


##########
cpp/src/arrow/python/udf.h:
##########
@@ -41,6 +41,8 @@ struct ARROW_PYTHON_EXPORT ScalarUdfOptions {
   std::shared_ptr<DataType> output_type;
 };
 
+/// \brief A context defined to hold meta-data required in
+/// scalar UDF execution.

Review Comment:
   It's fine to add this docstring, but by API documentation here we mean *Python* API (i.e. functions and classes exposed in the `pyarrow.compute` namespace).



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa

Review Comment:
   Why is this repeating the entire example? I'm afraid I don't understand the point of this.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+When all the inputs are scalar, the input is a size=1 array and the values has to be properly
+treated within the UDF. And also make sure to include the final output as a size=1 array.

Review Comment:
   Isn't this repeating what you already said in the note above?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+When all the inputs are scalar, the input is a size=1 array and the values has to be properly
+treated within the UDF. And also make sure to include the final output as a size=1 array.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.

Review Comment:
   Here the "More generally ..." sentence should come.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   Why is this taking `m[0]`? I don't understand what this example is supposed to show...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192313146

   https://issues.apache.org/jira/browse/ARROW-17181


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r978543900


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   Yeah that make sense. Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r978442938


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)

Review Comment:
   Small nitpick: can you use 4-space indentation in the python snippets?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   Is this conversion to Scalar still needed? (with the latest refactor of Wes, scalars might now be handled by len-1 arrays?) 
   In any case, the `gcd_numpy` function itself won't work with scalars because of the `pa.array(..)` call in it:
   
   ```
   In [32]: gcd_numpy(None, pa.scalar(27), pa.scalar(63))
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   <ipython-input-32-5dc8dd5d05b1> in <module>
   ----> 1 gcd_numpy(None, pa.scalar(27), pa.scalar(63))
   
   <ipython-input-26-1579a8ef575a> in gcd_numpy(ctx, x, y)
        22    np_x = to_np(x)
        23    np_y = to_np(y)
   ---> 24    return pa.array(np.gcd(np_x, np_y))
        25 pc.register_scalar_function(gcd_numpy,
        26                            function_name,
   
   ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
   
   ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
   
   ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
   
   TypeError: 'numpy.int64' object is not iterable
   ```
   
   
   
   



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   I think we should also mention somewhere (more to the beginning of the new section, I think), that currently the UDFs are limited to scalar functions (and then also explain what a scalar function is). 
   That will also make it easier to refer to that concept here to say that projections currently only support scalar functions.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.

Review Comment:
   This is probably related to my comment above, and could indeed explain why the function would work with scalars (if one of both is not an array, numpy will also return an array, and the `pa.array(..)` call will not fail)
   
   However, adding a print statement for the type of argument in `to_np` and running this example again, I see:
   
   ```
   In [3]: pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
   <class 'pyarrow.lib.Int64Array'>
   <class 'pyarrow.lib.Int64Array'>
   Out[3]: <pyarrow.Int64Scalar: 9>
   ```
   
   So it seems that both arguments are converted to an array. If that is guaranteed to always be the case now, the above paragraph is outdated.
   
   



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)

Review Comment:
   ```suggestion
      >>> data_table = pa.table({'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]})
   ```
   
   (a bit simpler to create the same table)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1256015089

   Related to the discussion above about callling UDFs with expressions (from https://github.com/apache/arrow/pull/13687#issuecomment-1240399112 above), I opened two JIRAs:
   
   * [[ARROW-17826] [Python] Allow scalars when creating expression from compute kernels](https://issues.apache.org/jira/browse/ARROW-17826 "[ARROW-17826] [Python] Allow scalars when creating expression from compute kernels - ASF JIRA")
   * [[ARROW-17827] [Python] Allow calling UDF kernels with field/scalar expressions](https://issues.apache.org/jira/browse/ARROW-17827 "[ARROW-17827] [Python] Allow calling UDF kernels with field/scalar expressions - ASF JIRA")
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240681191

   > I understand, but the affine function example isn't realistic either.
   
   I agree. Let me think of a better one. How about using `scikit-learn` to do a regression analysis on this dataset. May be to predict the prices give some factors. The current dataset is not sound, but it can be replaced. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r965650420


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,174 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+any combination of these types. It is important that the UDF author ensures
+the UDF can handle such combinations correctly. Also the ability to use UDFs
+with existing data processing libraries is very useful.
+
+Let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar/array inputs 
+`m`, `x` and `c` using Numpy arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import numpy as np
+   >>> function_name = "affine_with_numpy"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Numpy",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def to_numpy(val):
+   ...     if isinstance(val, pa.Scalar):
+   ...         return val.as_py()
+   ...     else:
+   ...         return np.array(val)
+   ... 
+   >>> def affine_with_numpy(ctx, m, x, c):
+   ...     m = to_numpy(m)
+   ...     x = to_numpy(x)
+   ...     c = to_numpy(c)
+   ...     return pa.array(m * x + c)
+   ... 
+   >>> pc.register_scalar_function(affine_with_numpy,
+   ...                             function_name,
+   ...                             function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.array([10.2, 20.2]), pa.scalar(20.2)])
+   <pyarrow.lib.DoubleArray object at 0x10e38eb20>
+   [
+      123.22,
+      224.21999999999997
+   ]
+
+Note that there is a helper function `to_numpy` to handle the conversion of scalar an array inputs
+to the UDf. Also, the final output is returned as a scalr or an array depending on the inputs.

Review Comment:
   ```suggestion
   to the UDF. Also, the final output is returned as a scalar or an array depending on the inputs.
   ```



##########
docs/source/python/api/compute.rst:
##########
@@ -555,3 +555,12 @@ Compute Options
    TrimOptions
    VarianceOptions
    WeekOptions
+
+Custom Functions

Review Comment:
   ```suggestion
   User-Defined Functions
   ```
   
   (to keep the titles consistent)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240399112

   > Plus, we should probably address the `ds.field('')._call` issue before we worry too much about extensive documentation.
   
   The reason that this `_call` is currently private with a leading underscore, is because for the built-in compute functions, you can actually use the compute function itself and pass it a field expression instead of actual array:
   
   ```
   >>> import pyarrow.compute as pc
   
   # you can do
   >>> pc.field('a')._call("add", [pc.field("b")])
   <pyarrow.compute.Expression add(b)>
   # instead of
   >>> pc.Expression._call("add", [pc.field("a"), pc.field("b")])
   <pyarrow.compute.Expression add(a, b)>
   ```
   
   which was sufficient for the initial examples for dataset projections. 
   Now, this might have some limitations. It already seems this is currently limited to only expressions as arguments, so you can't mix with a scalar right now (as the current example would do):
   
   ```
   >>> pc.add(pc.field('a'), 1)
   ...
   TypeError: only other expressions allowed as arguments
   ```
   
   Now, that might be something we can fix (didn't again look into it at the moment, I suppose I added this limitation in the initial PR for simplicity)
   
   For UDFs, there is of course the additional limitation that this isn't available as a `pc.` function. For this use case, we should maybe allow `pc.call_function` to accept expressions as well? 
   So that you can do `pc.call_function("my_udf", [pc.field("a")])` instead of `pc.Expression.call("my_udf", [pc.field("a")])`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240740941

   @vibhatha I think you can also change the dataset that is used in that example (currently with "category" and "value" columns) to something for which gcd makes more sense (or just some dummy column names, that might be simple enough).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928101158


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   Yes, no issue there. But the user need to know that part and how to handle it if they are using other libraries within the UDFs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192435920

   cc @amol- @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927597609


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Why cost_update? What does it have to do with costs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928048383


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa

Review Comment:
   I might want to add it to the one below too. But for non repetition, we can just import it once. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928061443


##########
cpp/src/arrow/python/udf.h:
##########
@@ -41,6 +41,8 @@ struct ARROW_PYTHON_EXPORT ScalarUdfOptions {
   std::shared_ptr<DataType> output_type;
 };
 
+/// \brief A context defined to hold meta-data required in
+/// scalar UDF execution.

Review Comment:
   Yes, I think all the docs are within the `_compute.pyx` file but not in the `compute.py` file because we just referencing it in the imports. Would that avoid the API docs being generated? This part I am not quite sure. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1201300100

   We can also remove this entire example (the one showing execution on all scalars), because I'm not sure how useful it actually is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

westonpace commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r951530500


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+Note that here the the final output is returned as an array. Depending the usage of vivid libraries
+inside the UDF, make sure it is generalized to support the passed input values and return suitable
+values. 
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in a column by using a mathematical formula.
+For instance, let's consider a simple affine operation on values using the
+mathematical expression, `y = mx + c`. We will be re-using the registered `affine`
+function.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [10.21, 20.12, 45.32, 15.12]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'projected_value': ds.field('')._call("affine", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   total_amount_projected($): int64
+   total_amount($): int64
+   trip_name: string
+   ----
+   total_amount_projected($): [[52,102,227,77]]
+   total_amount($): [[10,20,45,15]]
+   trip_name: [["A","B","C","D"]]
+
+Here note that the `ds.field('')_call()` returns an expression. The passed arguments
+to this function call are expressions not scalar values 
+(i.e `pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)`). This expression is evaluated
+when the project operator uses this expression.
+
+Support
+-------
+
+It is defined that the current support is only for scalar functions. 
+A scalar function (:class:`arrow::compute::ScalarFunction`) executes elementwise operations
+on arrays or scalars. Generally, the result of such an execution doesn't
+depend on the order of values.
+
+There is a limitation in the support to UDFs in the current API.
+For instance, with project node, if a UDF is used as the compute function,
+it expects the function to be a scalar function. Although, this doesn't stop the user
+registering a non-scalar function and using it in a programme. 
+But it could lead to unexpected behaviors or errors when it is applied in such occasions. 
+The current UDF support could enhance with the addition of more settings to the API (i.e aggregate UDFs).

Review Comment:
   ```suggestion
   Projection Expressions
   ^^^^^^^^^^^^^^^^^^^^^^
   
   In the above example we used an expression to add a new column (`total_amount_projected`)
   to our table.  Adding new, dynamically computed, columns to a table is known as "projection"
   and there are limitations on what kinds of functions can be used in projection expressions.
   
   A projection function must emit a single output value for each input row.  That output value
   should be calculated entirely from the input row and should not depend on any other row.
   For example, the "affine" function that we've been using as an example above is a valid
   function to use in a projection.  A "cumulative sum" function would not be a valid function
   since the result of each input rows depends on the rows that came before.  A "drop nulls"
   function would also be invalid because it doesn't emit a value for some rows.
   ```
   



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 

Review Comment:
   ```suggestion
   any combination of these types. It is important that the UDF author ensures
   the UDF can handle such combinations correctly. 
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+Note that here the the final output is returned as an array. Depending the usage of vivid libraries

Review Comment:
   I'm not sure what you are trying to say with the sentence that starts "Depending the..."



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.
   
   However, it is not clear from the description that this is the purpose of this example (is it?).  It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+Note that here the the final output is returned as an array. Depending the usage of vivid libraries
+inside the UDF, make sure it is generalized to support the passed input values and return suitable
+values. 
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+
+Consider an instance where the data is in a table and you need to create a new 
+column using existing values in a column by using a mathematical formula.
+For instance, let's consider a simple affine operation on values using the
+mathematical expression, `y = mx + c`. We will be re-using the registered `affine`
+function.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [10.21, 20.12, 45.32, 15.12]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'projected_value': ds.field('')._call("affine", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   total_amount_projected($): int64
+   total_amount($): int64
+   trip_name: string
+   ----
+   total_amount_projected($): [[52,102,227,77]]
+   total_amount($): [[10,20,45,15]]
+   trip_name: [["A","B","C","D"]]
+
+Here note that the `ds.field('')_call()` returns an expression. The passed arguments
+to this function call are expressions not scalar values 
+(i.e `pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)`). This expression is evaluated
+when the project operator uses this expression.

Review Comment:
   You say "The passed arguments to this function call are expressions not scalar values".
   
   However, `pc.scalar(5.2)` and `pc.scalar(2.1)` look like scalar values.  I'm not sure a user will recognize the subtle difference between `pc.scalar(5.2)` and `pa.scalar(5.2)` without further explanation.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,

Review Comment:
   I think `any` would be better.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,

Review Comment:
   ```suggestion
      ...                             function_docs,
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+Note that here the the final output is returned as an array. Depending the usage of vivid libraries

Review Comment:
   Is it an array?  I see:
   
   ```
   <pyarrow.DoubleScalar: 123.22>
   ```
   
   It probably should be an array.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,
+the UDF is defined such that it can handle such combinations well. 
+
+For instance, when the passed values to a function are all scalars, internally
+each scalar is passed as an array of size 1.
+
+To elaborate on this, let's consider a scenario where we have a function
+which computes a scalar `y` value based on scalar inputs 
+`m`, `x` and `c` using python arithmetic operations.
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>

Review Comment:
   The function correctly handles the all-scalar case but it does not handle other cases.  Ideally, an example should demonstrate how to write a UDF that can handle all possible cases.  For example:
   
   ```
   print('10.0 * 5.0 + 1.0 should be 51.0')
   print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.scalar(5.0), pa.scalar(1.0)])}')
   
   print('[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]')
   print(f'Answer={pc.call_function(function_name, [pa.array([10.0, 10.0]), pa.array([5.0, 6.0]), pa.array([1.0, 1.0])])}')
   
   print('10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]')
   print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
   ```
   
   Right now, the function as it is designed, gives me this output:
   
   ```
   10.0 * 5.0 + 1.0 should be 51.0
   Answer=51.0
   [10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]
   Answer=[
     51
   ]
   10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]
   Traceback (most recent call last):
     File "/home/pace/experiments/arrow-17181/repr.py", line 39, in <module>
       print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
     File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
     File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/_compute.pyx", line 2506, in pyarrow._compute._scalar_udf_callback
     File "/home/pace/experiments/arrow-17181/repr.py", line 21, in affine_with_python
       m = m[0].as_py()
   TypeError: 'pyarrow.lib.DoubleScalar' object is not subscriptable
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240575947

   > > Plus, we should probably address the `ds.field('')._call` issue before we worry too much about extensive documentation.
   
   I understand your point. Should we hold this PR until we resove this issue?
   
   > 
   > The reason that this `_call` is currently private with a leading underscore, is because for the built-in compute functions, you can actually use the compute function itself and pass it a field expression instead of actual array:
   > 
   > ```
   > >>> import pyarrow.compute as pc
   > 
   > # you can do
   > >>> pc.field('a')._call("add", [pc.field("b")])
   > <pyarrow.compute.Expression add(b)>
   > # instead of
   > >>> pc.Expression._call("add", [pc.field("a"), pc.field("b")])
   > <pyarrow.compute.Expression add(a, b)>
   > ```
   > 
   > which was sufficient for the initial examples for dataset projections. Now, this might have some limitations. It already seems this is currently limited to only expressions as arguments, so you can't mix with a scalar right now (as the current example would do):
   > 
   > ```
   > >>> pc.add(pc.field('a'), 1)
   > ...
   > TypeError: only other expressions allowed as arguments
   > ```
   > 
   > Now, that might be something we can fix (didn't again look into it at the moment, I suppose I added this limitation in the initial PR for simplicity)
   > 
   > For UDFs, there is of course the additional limitation that this isn't available as a `pc.` function. For this use case, we should maybe allow `pc.call_function` to accept expressions as well? So that you can do `pc.call_function("my_udf", [pc.field("a")])` instead of `pc.Expression.call("my_udf", [pc.field("a")])`?
   
   
   
   > > I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function
   > 
   > I think this would indeed be a more compelling example.
   > 
   > Another example could be a specific python functionality (eg something from `ipaddress`, to check or extract some information from strings that are supposed to be ipaddresses), although this will typically only work on scalars, and thus will be slow (but it's still an example how you can use this within arrow). Or another example could be a custom function implemented in numba.
   
   This is a better addition for the cookbook. Or should we add something like that here too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1248822506

   @pitrou @jorisvandenbossche @westonpace I updated the PR with `gcd` function and simplified the docs. Please take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r977456827


##########
docs/source/python/api/compute.rst:
##########
@@ -555,3 +555,12 @@ Compute Options
    TrimOptions
    VarianceOptions
    WeekOptions
+
+Custom Functions

Review Comment:
   @vibhatha Looks like you forgot to apply the suggestion before resolving?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r979134031


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.

Review Comment:
   Ah nice catch on the description. Removing it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r977727773


##########
docs/source/python/api/compute.rst:
##########
@@ -555,3 +555,12 @@ Compute Options
    TrimOptions
    VarianceOptions
    WeekOptions
+
+Custom Functions

Review Comment:
   Oh sorry I missed this one. 



##########
docs/source/python/api/compute.rst:
##########
@@ -555,3 +555,12 @@ Compute Options
    TrimOptions
    VarianceOptions
    WeekOptions
+
+Custom Functions

Review Comment:
   Added the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240637976

   @pitrou and @jorisvandenbossche I propose the following. 
   
   1. Use `gcd` as the example for a basic demonstration of UDF's capability. 
   2. To show case the usage with a dataset, use the `affine` function.
   3. Avoid documenting `_call` in UDF docs
   4. Regarding exploring the `_call` start a JIRA thread and possibly work on a PR to improve things. 
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240662634

   > > * Use `gcd` as the example for a basic demonstration of UDF's capability.
   > > * To show case the usage with a dataset, use the `affine` function.
   > 
   > Why not use the same example for both?
   
   With the current dataset, does it make a meaningful output? Why do we need `gcd` for that dataset? This is the doubt. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240679391

   I understand, but the affine function example isn't realistic either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927617427


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Sure, that's much better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192717089

   @vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1193009915

   > @vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)
   
   @lidavidm I have already included the docs in the Cython in the original PR, but that function is in compute.pyx and only referred as an import in the compute.py. I was not sure whether to mark the method in Cython underscore function and call it in the compute.py and add the docs there. That was the confusing part. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192313166

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927597039


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Ah I think I started thinking about writing a regression and ended up with a simple equation. Let's rename it to something like `cost_update`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1259499030

   > Thanks for the update. This looks mostly good to me now.
   > 
   > @vibhatha Can you ensure the `testing` git submodule is not pushed back by this PR? See the Github diff.
   
   Oh that's a mistake. I will fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r934589966


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,165 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+Generalizing Usage
+------------------
+
+PyArrow UDFs accept input types of both scalar and array. Also it can have
+vivid combinations of these types. It is important that the UDF author must make sure,

Review Comment:
   Is "vivid" the right word here? @westonpace What do you think?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927576724


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",

Review Comment:
   ```suggestion
         "summary": "Calculate y = mx + c",
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"

Review Comment:
   ```suggestion
         "description":
             "Compute the affine function y = mx + c.\n"
             "This function takes three inputs, m, x and c, in order."
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.

Review Comment:
   ```suggestion
   To register a UDF, a function name, function docs and input types and output type need to be defined.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Why this name?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 

Review Comment:
   ```suggestion
   PyArrow allows defining and registering custom compute functions in Python.
   Those functions can then be called from Python as well as C++ (and potentially
   any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
   using their registered function name.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()

Review Comment:
   Hmm, really? A realistic example would take float64, not int64.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.

Review Comment:
   ```suggestion
   .. warning::
      This API is **experimental**.
      Also, only scalar functions can currently be user-defined.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.

Review Comment:
   ```suggestion
   The implementation of a user-defined function always takes a first *context*
   parameter (named ``ctx`` in the example above) which is an instance of
   :class:`pyarrow.compute.ScalarUdfContext`.
   This context exposes several useful attributes, particularly a
   :attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
   allocations in the context of the user-defined function.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::

Review Comment:
   Can you remove the warning and turn this into an example instead?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.
+
+.. code-block:: python
+
+   >>> sample_data = {'trip_name': ['A', 'B', 'C', 'D'], 'total_amount($)': [10, 20, 45, 15]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> import pyarrow.dataset as ds

Review Comment:
   Please move the import at the beginning of this snippet.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,

Review Comment:
   ```suggestion
   You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.
+
+.. code-block:: python
+
+   >>> sample_data = {'trip_name': ['A', 'B', 'C', 'D'], 'total_amount($)': [10, 20, 45, 15]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> import pyarrow.dataset as ds
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(5), ds.field("total_amount($)"), pc.scalar(2)]
+   >>> result_table = dataset.to_table(
+   ...             columns={
+   ...                 'total_amount_projected($)': ds.field('')._call(function_name, func_args),

Review Comment:
   What is `function_name` here? Perhaps it would be better to spell it explicitly.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 

Review Comment:
   It's not linear but affine, so should fix the name.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.

Review Comment:
   Perhaps rephrase it to make it less exceptional:
   ```suggestion
   More generally, user-defined functions are usable everywhere a compute function
   can be referred to by its name. For example, they can be called on a dataset's
   column using :meth:`Expression._call`:
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.
+
+.. code-block:: python
+
+   >>> sample_data = {'trip_name': ['A', 'B', 'C', 'D'], 'total_amount($)': [10, 20, 45, 15]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> import pyarrow.dataset as ds
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(5), ds.field("total_amount($)"), pc.scalar(2)]
+   >>> result_table = dataset.to_table(
+   ...             columns={
+   ...                 'total_amount_projected($)': ds.field('')._call(function_name, func_args),

Review Comment:
   Also @jorisvandenbossche do you remember why `Expression._call` has a leading underscore?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)

Review Comment:
   Can we perhaps use the adequate memory pool here? For example:
   ```suggestion
      def affine_calculation(ctx, m, x, c):
          temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
          return pc.add(temp, c, memory_pool=ctx.memory_pool)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928046280


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   I referred to this: https://github.com/apache/arrow/pull/13687#discussion_r927582170



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r929466056


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   @pitrou should we keep this example or just keep a note/warning? I am inclined towards the example, though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927614022


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   may be `project_price`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1222730414

   I think part of the challenge with this documentation is that implementing `affine` in pure-python is not a very compelling use case.  I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data.  For example, numpy.  Here is an example that exposes numpy's `gcd` function (greatest common divisor) as an Arrow function:
   
   ```
   import numpy as np
   
   import pyarrow as pa
   import pyarrow.compute as pc
   
   function_name = "numpy_gcd"
   function_docs = {
          "summary": "Calculates the greatest common divisor",
          "description":
              "Given 'x' and 'y' find the greatest number that divides\n"
              "evenly into both x and y."
   }
   
   input_types = {
      "x" : pa.int64(),
      "y" : pa.int64()
   }
   
   output_type = pa.int64()
   
   def to_np(val):
       if isinstance(val, pa.Scalar):
           return val.as_py()
       else:
           return np.array(val)
   
   def gcd_numpy(ctx, x, y):
       np_x = to_np(x)
       np_y = to_np(y)
       return pa.array(np.gcd(np_x, np_y))
   
   pc.register_scalar_function(gcd_numpy,
                               function_name,
                               function_docs,
                               input_types,
                               output_type)
   
   print('gcd(27, 63) should be 9')
   print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
   print()
   print('gcd([27, 18], [54, 63]) should be [27, 9]')
   print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
   print()
   print('gcd(27, [54, 18]) should be [27, 9]')
   print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')
   ```
   
   Notice the use of the helper function `to_np` to convert from inputs of different shapes to ensure that we get something that numpy can work with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240659913

   > * Use `gcd` as the example for a basic demonstration of UDF's capability.
   > 
   > * To show case the usage with a dataset, use the `affine` function.
   
   Why not use the same example for both?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1197607634

   @westonpace thanks for the detailed review. I will address these. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

westonpace commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r931518702


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.

Review Comment:
   ```suggestion
   To register a UDF, a function name, function docs, input types and output type need to be defined.
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.

Review Comment:
   ```suggestion
   ```
   
   First, we probably don't need to call this out.  Second, we don't do a great job of defining "scalar" anywhere.
   
   A "compute function" doesn't have to be a scalar function, not even a custom UDF.  Nothing prevents you today from calling `register_scalar_udf` and providing a non-scalar function.
   
   However, if that compute function is going to be used in an a project node, then it must be a scalar function.  We won't detect that it is non-scalar (I'm not sure this is possible).  Instead, if it is non-scalar, the user will just get unexpected output (or an error, hopefully, if they are returning a different sized output).
   
   When we add aggregate functions do we envision a new `register_aggregate_function`?  I suppose it would be different in that it would take multiple functions and not just a single function.
   
   Maybe we could add a section at the end defining what a scalar function is when we talk about using these functions in the datasets API?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,

Review Comment:
   ```suggestion
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds

Review Comment:
   This iexample is "ok" but it seems a little nonsensical that the user would want 5.2*total_amount + 2.1.
   
   We should also discuss that this is equivalent to what you can do without defining a UDF here:
   
   ```
   'total_amount_projected($)': ds.field('')._call("add", [2.1, ds.field('')._call("multiply", [5.2, ds.field("total_amount($)"]))),
   ```
   
   Plus, we should probably address the `ds.field('')._call` issue before we worry too much about extensive documentation.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   If we are going to keep this example then we should:
   
    * Change the above `..note::` so that it isn't a note but a dedicated subsection.
    * Move the paragraph starting with "More generally," to come after the example.
    * Add a paragraph motivating the example.
   
   However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,127 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])
+   ... 
+   >>> pc.register_scalar_function(affine_with_python,
+   ...                             function_name,
+   ...                         function_docs,
+   ...                             input_types,
+   ...                             output_type)
+   >>> 
+   >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
+   <pyarrow.DoubleScalar: 123.22>
+
+In case of all scalar inputs, make sure to return the final output as an array.
+
+More generally, UDFs can be used with tabular data by using `dataset` API and apply a UDF function on a
+dataset.

Review Comment:
   This paragraph should be replaced by the 'More generally" paragraph above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r979132814


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   I added a section the limitations and scalar function definition. I merely rephrased what is the docs to be precise. Didn't want introduce anything extra.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1254824241

   @wjones127 Would you like to take a look at the wording here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1201299255

   The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1240745901

   @jorisvandenbossche and @pitrou thank you for the suggestions. I will update the PR accordingly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1192306076

   cc @westonpace @lidavidm @pitrou


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927618739


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"
+   function_docs = {
+      "summary": "Calculate y based on m, x and c values",
+      "description": "Obtaining output of a linear scalar function"
+   }
+   input_types = {
+      "m" : pa.int64(),
+      "x" : pa.int64(),
+      "c" : pa.int64(),
+   }
+   output_type = pa.int64()
+
+   def linear_calculation(ctx, m, x, c):
+      return pc.add(pc.multiply(m, x), c)
+
+   pc.register_scalar_function(linear_calculation, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+.. note::
+   There is a default parameter, `ctx` which is refers to a context object and it should be the
+   first parameter of any user-defined-function. The idea is to make available passing required
+   meta-data across an application which would be important for UDFs.
+
+Calling a UDF directly using :func:`pyarrow.compute.call_function`,
+
+.. code-block:: python
+
+   >>> res = pc.call_function("regression", [pa.scalar(2), pa.scalar(10), pa.scalar(5)])
+   25
+
+.. warning::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+UDFs can be used with tabular data by using `dataset` API and apply a UDF function on the
+dataset.
+
+.. code-block:: python
+
+   >>> sample_data = {'trip_name': ['A', 'B', 'C', 'D'], 'total_amount($)': [10, 20, 45, 15]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> import pyarrow.dataset as ds
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(5), ds.field("total_amount($)"), pc.scalar(2)]
+   >>> result_table = dataset.to_table(
+   ...             columns={
+   ...                 'total_amount_projected($)': ds.field('')._call(function_name, func_args),

Review Comment:
   `function_name` replaced with "affine"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r927614538


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,80 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   User-defined functions only supports scalar functions and the current version is experimental.
+
+To use a user-defined-function (UDF), either the experimental `dataset` API options can be used or the
+function can be directly called using :func:`pyarrow.compute.call_function`. 
+
+To register a UDF, a function name, function docs and input types and output type needs to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "regression"

Review Comment:
   Or `y=mx+c` simply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r928062590


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   DId I misinterpret your statement? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r932036997


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,129 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+   Also, only scalar functions can currently be user-defined.
+
+PyArrow allows defining and registering custom compute functions in Python.
+Those functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package`)
+using their registered function name.
+
+To register a UDF, a function name, function docs and input types and output type need to be defined.
+
+.. code-block:: python
+
+   import pyarrow.compute as pc
+   function_name = "affine"
+   function_docs = {
+      "summary": "Calculate y = mx + c",
+      "description":
+          "Compute the affine function y = mx + c.\n"
+          "This function takes three inputs, m, x and c, in order."
+   }
+   input_types = {
+      "m" : pa.float64(),
+      "x" : pa.float64(),
+      "c" : pa.float64(),
+   }
+   output_type = pa.float64()
+
+   def affine(ctx, m, x, c):
+       temp = pc.multiply(m, x, memory_pool=ctx.memory_pool)
+       return pc.add(temp, c, memory_pool=ctx.memory_pool)
+
+   pc.register_scalar_function(affine, 
+                               function_name,
+                               function_docs,
+                               input_types,
+                               output_type)
+
+The implementation of a user-defined function always takes a first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("affine", [pa.scalar(2.5), pa.scalar(10.5), pa.scalar(5.5)])
+   <pyarrow.DoubleScalar: 31.75>
+
+.. note::
+   Note that when the passed values to a function are all scalars, internally each scalar 
+   is passed as an array of size 1.
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`:
+Considering a series of scalar inputs,
+
+.. code-block:: python
+
+   >>> import pyarrow as pa
+   >>> import pyarrow.compute as pc
+   >>> function_name = "affine_with_python"
+   >>> function_docs = {
+   ...        "summary": "Calculate y = mx + c with Python",
+   ...        "description":
+   ...            "Compute the affine function y = mx + c.\n"
+   ...            "This function takes three inputs, m, x and c, in order."
+   ... }
+   >>> input_types = {
+   ...    "m" : pa.float64(),
+   ...    "x" : pa.float64(),
+   ...    "c" : pa.float64(),
+   ... }
+   >>> output_type = pa.float64()
+   >>> 
+   >>> def affine_with_python(ctx, m, x, c):
+   ...     m = m[0].as_py()
+   ...     x = x[0].as_py()
+   ...     c = c[0].as_py()
+   ...     return pa.array([m * x + c])

Review Comment:
   I re-organized and included content. Please check it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1227941851

   @westonpace I updated the PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #13687:
URL: https://github.com/apache/arrow/pull/13687#issuecomment-1255707177

   cc @jorisvandenbossche @pitrou another look may be 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou merged pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

pitrou merged PR #13687:
URL: https://github.com/apache/arrow/pull/13687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r981332172


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   @vibhatha I don't think this one is already resolved. I think you can actually remove the `if isinstance(val, pa.Scalar): return val.as_py()` part, since the input will never be a scalar?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r982584769


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   @jorisvandenbossche if we do that, we get the following error
   
   ```bash
   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
     File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/_compute.pyx", line 2506, in pyarrow._compute._scalar_udf_callback
     File "<stdin>", line 4, in gcd_numpy
     File "/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py", line 790, in _gcd
       a, b = b, a % b
   TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and 'int'
   ```
   
   I think the reason is, Numpy cannot identify the passed in Arrow scalar value. We need to take the python value of it or convert it to numpy. 
   
   The following is what would happen 
   
   ```bash
   >>> np.gcd(np.array(pa.scalar(27)), np.array([81, 12, 5]))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py", line 790, in _gcd
       a, b = b, a % b
   TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and 'int'
   ```
   
   But it would work for 
   
   ```bash
   np.gcd(np.array(pa.array([27])), np.array([81, 12, 5]))
   array([27,  3,  1])
   ```
   
   but not for 
   
   ```bash
   >>> np.gcd(np.array(pa.scalar(27)), np.array([81, 12, 5]))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py", line 790, in _gcd
       a, b = b, a % b
   TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and 'int'
   ```
   
   And again works for
   
   
   ```bash
   >>> np.gcd(np.array(27), np.array([81, 12, 5]))
   array([27,  3,  1])
   ```
   
   Am I wrong here or missing something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r977763363


##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.

Review Comment:
   I don't think this sentence says anything useful, at least in this example. There's no ambiguity on the Python side of when this would execute.
   
   ```suggestion
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's

Review Comment:
   ```suggestion
   can be referred to by its name. For example, they can be called on a dataset's
   ```



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and :class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a :func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and :func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   Is this paragraph describing the definition of a **scalar** function? Or is that different. Either way, I think we want to say that explicitly ("This is what it means to be a scalar function" / "These requirements are distinct from the definition of a scalar function".).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org