You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/30 11:33:07 UTC

[GitHub] [arrow-julia] bkamins opened a new issue, #353: Add an indexable variant of Arrow.Stream

bkamins opened a new issue, #353:
URL: https://github.com/apache/arrow-julia/issues/353

   In distributed computing context it would be nice to have a vector-variant of `Arrow.Stream` iterator. The idea is to be able to split processing of a single large arrow file with multiple record batches into multiple worker processes. Looking at the source code this should be possible to be done in a relatively efficient way.
   
   @quinnj - what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] baumgold commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

baumgold commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1298960939

   I don't think this is possible.  The Arrow file format is a series of FlatBuffer messages that are not indexed and therefore have to be iterated over.  More concretely, the `BatchIterator` doesn't support random access.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] quinnj commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

quinnj commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1301188594

   Yeah, we could probably add support for this. Maybe with a `lazy::Bool=true` keyword argument; `lazy=false` would eagerly iterate messages and store the positions so they could be randomly accessed while `lazy=true` gives the current behavior where each iteration only consumes one message.
   
   Curious though, because a non-hard workflow you can already do is:
   ```julia
   for record_batch in Arrow.Stream(...)
       Distributed.@spawn begin
           # do stuff with record_batch
       end
   end
   ```
   what are the alternative workflows where that doesn't work for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] bkamins commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

bkamins commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1301194710

   What you propose works, but I thought in this approach the parallelism would not be achieved (i.e. that `Arrow.Stream` would parse values before moving forward to the next record batch). If it does just skip ahead then the issue can be closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] JoaoAparicio commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by "JoaoAparicio (via GitHub)" <gi...@apache.org>.

JoaoAparicio commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1493484757

   > My idea was that the constructor of such indexable object could do the indexing you mention. I assume that the whole file would have to be scanned, but maybe it could be done in a cheap way, i.e. without having to read/interpret all the data stored in the file.
   
   I implemented this minus the indexing. Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] baumgold commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

baumgold commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1309333061

   > Ah, you're correct; we do all the message processing in the `Arrow.Stream` iterate method. Ok, yeah, we should provide an alternative here.
   
   This would be a great improvement as it would also allow predicate-pushdown at the RecordBatch level based on Message-level or Column-level metatdata, thus opening up the ability to operate on a single RecordBatch without uncompressing all RecordBatches in a file.  This is an important feature for me so I'll try to spend some time building this without breaking too much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] quinnj commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

quinnj commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1301196433

   Ah, you're correct; we do all the message processing in the `Arrow.Stream` iterate method. Ok, yeah, we should provide an alternative here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-julia] bkamins commented on issue #353: Add an indexable variant of Arrow.Stream

Posted by GitBox <gi...@apache.org>.

bkamins commented on issue #353:
URL: https://github.com/apache/arrow-julia/issues/353#issuecomment-1298977063

   My idea was that the constructor of such indexable object could do the indexing you mention. I assume that the whole file would have to be scanned, but maybe it could be done in a cheap way, i.e. without having to read/interpret all the data stored in the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org