You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jacob Baumbach (Jira)" <ji...@apache.org> on 2021/04/08 05:22:00 UTC

[jira] [Created] (ARROW-12293) [Rust][DataFusion] Word Count

Jacob Baumbach created ARROW-12293:
--------------------------------------

             Summary: [Rust][DataFusion] Word Count
                 Key: ARROW-12293
                 URL: https://issues.apache.org/jira/browse/ARROW-12293
             Project: Apache Arrow
          Issue Type: Wish
          Components: Rust - DataFusion
            Reporter: Jacob Baumbach


I am learning DataFusion and tried to do the canonical big data version of hello world, word count, using DataFusion.  I have been unsuccessful, and I am wondering if word count is even currently possible with DataFusion.

 

Typically word count involves a flat_map where you split each string based on the white space contained within each string.  

 

There are two issues I am running into

1) creating a udf that goes from &str -> Vec<&str>.  I cannot find an `arrow::array` that maps to a collection of string, which is preventing me from creating a udf that can perform the split.

2) Assuming I could get `1` to work, I am not aware of a method that is similar to flat_map that may be performed on a column.  In sql, I believe this is called `explode`, which I can't find in the codebase, which makes me think flat_map style operations aren't possible.

 

My questions are:

Is word count currently possible in DataFusion?  If so, how can perform the split and how can you perform a flat_map?  If word count cannot be done, what would need to be implemented to make it possible?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)