You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/12/11 21:59:31 UTC

[I] Parallel Arrow file format reading [arrow-datafusion]

alamb opened a new issue, #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #8503: Parallel Arrow file format reading
URL: https://github.com/apache/arrow-datafusion/issues/8503


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1864089351

   I'd like to have a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1869606136

   > Perhaps we could do the same for arrow files which could use the first byte of the RecordBatches 🤔
   > 
   
   There maybe several RecordBatches(blocks in arrow-rs) in a Arrow file(I didn't notice it before). We can handle it like rowgroups in parquet. 
   
   I will check whether DICTIONARY can be handled correctly since there maybe Delta DICTIONARY.
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1872525360

   I will complete this after next release of arrow-rs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1850970641

   See also https://github.com/apache/arrow-datafusion/issues/8504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1871261259

   > > Delta DICTIONARY.
   > 
   > Delta and replacement dictionaries are only supported by IPC streams, not files
   
   get it! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1872949667

   The next release is tracked by https://github.com/apache/arrow-rs/issues/5234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1868508443

   > But I am wondering whether we can split the scan process into several parts and rebuild the whole Batch, since there maybe more than one array in file.
   
   This sounds like a good idea to me in theory -- I am not sure how easy/hard it would be to do with the existing arrow IPC reader
   
   In general, the strategy for paralleizing Paruqet and CSV is to be to split up the file by ranges,  and then have each of the `ArrowFileReader`s partitions read row groups (or CSV lines) that have their first byte within their assigned rnage
   
   Perhaps we could do the same for arrow files which could use the first byte of the RecordBatches 🤔 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1868475220

   I read related pr about parquet and csv.
   Parquet parallel scan is based on rowgroup and csv is based on line. Both of them can be splitted by row and then output RecordBatchs using  a certain method.
   I don't think arrow can be handled like that, since arrow file is purely column-based. 
   But I am wondering whether we can split the scan process into several parts and rebuild the whole Batch, since there maybe more than one array in file.
   ![图片](https://github.com/apache/arrow-datafusion/assets/48236141/8e7a8b19-f302-4678-96a8-d2d3af5f4c56)
   
   Merry Christmas!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1871141572

   https://github.com/apache/arrow-rs/pull/5249 adds a lower-level reader that should enable this and other use-cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #8503:
URL: https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1871046625

   > I will check whether DICTIONARY can be handled correctly since there maybe Delta DICTIONARY.
   
   It seems that delta dictionary batches not supported yet.
   
   And I think a pub function to provide offsets is needed in upstream.  Like 
   
   ```rust
   impl<R: Read + Seek> FileReader<R> {
       pub fn blocks(&self) -> Vec<Block> {
           &self.blocks
       }
      //OR
       pub fn blocks(&self) -> Vec<i64> {
           &self.blocks.iter().map(Block::offset).collect()
       }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org