You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Anthony Abate (Jira)" <ji...@apache.org> on 2019/10/09 14:14:00 UTC

[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

Anthony Abate created ARROW-6830:
------------------------------------

             Summary: Question / Feature Request- Select Subset of Columns in read_arrow
                 Key: ARROW-6830
                 URL: https://issues.apache.org/jira/browse/ARROW-6830
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, R
            Reporter: Anthony Abate


*Note:*  Not sure if this is a limitation of the R library or the underlying C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)