You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/02/03 16:12:04 UTC

[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

    [ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029070#comment-17029070 ] 

Joris Van den Bossche commented on ARROW-555:
---------------------------------------------

Do we already have a good idea of how we want to approach this? 
Because I think there has been some discussion on implementing custom C++ kernels (similar to other existing kernels in the compute module) vs finding a way to re-use the scalar kernels that are already implemented for gandiva.

For reference: Gandiva already has several string functions implemented. Illustration with the python interface for the "upper" function:

{code:python}
from pyarrow import gandiva
table = pa.table({'a': ['a', 'b', 'c']})

builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_upper = builder.make_function("upper", [node_a], pa.string())
field_result = pa.field('res', pa.string())
expr = builder.make_expression(node_upper, field_result)
projector = gandiva.make_projector(table.schema, [expr], pa.default_memory_pool())

>>> projector.evaluate(table.to_batches()[0])
[<pyarrow.lib.StringArray object at 0x7fc324f71580>
 [
   "A",
   "B",
   "C"
 ]]
{code}

> [C++] String algorithm library for StringArray/BinaryArray
> ----------------------------------------------------------
>
>                 Key: ARROW-555
>                 URL: https://issues.apache.org/jira/browse/ARROW-555
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory arranged in Arrow format. This will include using the re2 C++ regular expression library and other standard string manipulations (such as those found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)