You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Vibhatha Lakmal Abeykoon (Jira)" <ji...@apache.org> on 2022/10/25 07:09:00 UTC

[jira] [Assigned] (ARROW-15635) [C++][Python] UDF Integration

     [ https://issues.apache.org/jira/browse/ARROW-15635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vibhatha Lakmal Abeykoon reassigned ARROW-15635:
------------------------------------------------

    Assignee: Vibhatha Lakmal Abeykoon

> [C++][Python] UDF Integration 
> ------------------------------
>
>                 Key: ARROW-15635
>                 URL: https://issues.apache.org/jira/browse/ARROW-15635
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: C++, Python
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The objective is to list down a set of tasks required to provide UDF support for Apache Arrow streaming execution engine. In the first iteration we will be focusing on providing support for Python-based UDFs which can support Python functions. 
> The UDF Integration is going to pan out with a series of sub-tasks associated with the development and PoCs. Note that this is going to be the first iteration of UDF integrations with a limited scope. This ticket will cover the following topics;
>  # POC for UDF integration: The objective is to evaluate the existing components in the source and evaluate the required modifications and new building blocks required to integrate UDFs.
>  # The language will be limited to C+{+}/{+}Python users can register Python function as a UDF and use it with an `apply` method on Arrow Tables or provide a computation API endpoint via arrow::compute API. Note that the C+ API already provides a way to register custom functions via the function registry API. At the moment this is not exposed to Python. 
>  # Planned features for this ticket are;
>  ## Scalar UDFs : UDFs executed per value (per row)
>  ## Vector UDFs : UDFs executed per batch (a full array or partial array)
>  ## Aggregate UDFs : UDFs associated with an aggregation operation
>  # Integration limitations
>  ## Doesn't support custom data types which doesn't support Numpy or Pandas
>  ## Complex processing with parallelism within UDFs are not supported
>  ## Parallel UDFs are not supported in the initial version of UDFs. Allthough we are documenting what is required and a rough sketch for the next phase. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)