You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/28 07:55:16 UTC

[GitHub] [arrow-rs] Ted-Jiang opened a new issue, #1955: Support multi diskRanges for ChunkReader

Ted-Jiang opened a new issue, #1955:
URL: https://github.com/apache/arrow-rs/issues/1955

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   related to #1775 
   When i implement page index skipping #1792 , i found 
   ```
   /// The ChunkReader trait generates readers of chunks of a source.
   /// For a file system reader, each chunk might contain a clone of File bounded on a given range.
   /// For an object store reader, each read can be mapped to a range request.
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   it assume read whole column chunk bytes array, but when facing like 
   
   ```
        * rows   col1   col2   col3
        *      ┌──────┬──────┬──────┐
        *   0  │  p0  │      │      │
        *      ╞══════╡  p0  │  p0  │
        *  20  │ p1(X)│------│------│
        *      ╞══════╪══════╡      │
        *  40  │ p2   │      │------│
        *      ╞══════╡ p1(X)╞══════╡
        *  60  │ p3(X)│      │------│
        *      ╞══════╪══════╡      │
        *  80  │  p4  │      │  p1  │
        *      ╞══════╡  p2  │      │
        * 100  │  p5  │      │      │
        *      └──────┴──────┴──────┘
   ```
   
   read `col1` page1 and page3 we need skip other pages
   we should pass two offsets
   
   **Describe the solution you'd like**
   pass multi strart and length
   ```
    fn get_read(&self, start: vec<u64>, length: vec<usize>) -> Result<Self::T>;
   ```
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168368741

   @tustvold @alamb PLTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang closed issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

Ted-Jiang closed issue #1955: Support multi diskRanges for ChunkReader
URL: https://github.com/apache/arrow-rs/issues/1955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168478294

   Wow! Wonderful work! changing with each passing day 😂 i will catch up 😊


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168459574

   It seems use IOx ObjectStore will only support asyn reader?
   
   could you show me the code example of IOX integrate with arrow-rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168412426

   Why not just call get_read for each page instead of for the entire column chunk? There is no requirement for get_read to delimit column chunks, after all the same trait is used to read the footer, etc...
   
   Somewhat related, but something to keep in mind is how this will all work with `ParquetRecordBatchStream`. This does not make use of `ChunkReader`, and is instead push-based, needing to know the ranges to fetch up-front. It should just be a case of making `InMemoryColumnChunk` sparse and teaching `InMemoryColumnChunkReader` to read it correctly, but it is probably worth thinking about how this will work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168467248

   > Is there any need to support in ParquetRecordBatchReader, or They reuse a lot of logic between them (support one is like almost support both).
   
   They reuse a lot of logic, however, the logic that differs concerns the IO for fetching pages. So support for this would need to be explicitly added.
   
   > could you show me the code example of IOX integrate with arrow-rs
   
   Currently IOx fetches the entire file to memory and does not perform IO to object storage directly. This was partly driven by the limited support for more sophisticated predicate pushdown, and the fact IO was not a dominating factor for our query workloads.
   
   That being said, https://github.com/apache/arrow-datafusion/pull/2677 switches DataFusion to using the async interface directly, and https://github.com/apache/arrow-datafusion/issues/2504 has more about how I envisage this fitting with the rayon-based scheduler longer-term. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang commented on issue #1955: Support multi diskRanges for ChunkReader

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1955:
URL: https://github.com/apache/arrow-rs/issues/1955#issuecomment-1168447364

   > Why not just call get_read for each page instead of for the entire column chunk? There is no requirement for get_read to delimit column chunks, after all the same trait is used to read the footer, etc...
   
   make sense.
   
   > how this will all work with ParquetRecordBatchStream
   
   😂 For now i only check page filter in `ParquetRecordBatchReader`,  For `AsyncFileReader`  i haven't check the code (because  i found datafusion now use iterator mode).
   
   Is there any need to support in `ParquetRecordBatchReader`, or They reuse a lot of logic between them (support one is like almost support both).
   
   @tustvold How expert thinks 😊


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org