You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/28 12:17:41 UTC

[GitHub] [arrow-rs] Cheappie opened a new issue, #1626: Expose ArrowWriter row group flush in public API

Cheappie opened a new issue, #1626:
URL: https://github.com/apache/arrow-rs/issues/1626

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   From what I have read predicate pushdown filtering in parquet works on row-group level, so in my case I should be able to optimize reads by manually closing row-group.
   
   **Describe the solution you'd like**
   Simply expose in ArrowWriter API flush_row_group method that flushes all buffered rows.
   
   **Describe alternatives you've considered**
   I have considered using SerializedFileWriter, however due to my lack of complete understanding of definition, repetition levels I would prefer to use high level API like ArrowWriter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Cheappie commented on issue #1626: Expose ArrowWriter row group flush in public API

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #1626:
URL: https://github.com/apache/arrow-rs/issues/1626#issuecomment-1113984430

   Really thank you for sharing your thoughts with me, I will keep it in mind.
   
   In my case I have similar data structure to row-group and within such row-group I keep **related** data. In access path by manually sizing row-groups I would be able to grab exactly data that I am interested in. Maybe I would be able to achieve similar outcome with PageIndex, but it's not ready yet and I am a bit short on time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1626: Expose ArrowWriter row group flush in public API

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1626:
URL: https://github.com/apache/arrow-rs/issues/1626#issuecomment-1113803611

   I don't see any issue with exposing this, more power to the user, however, some thoughts:
   
   - I wonder if you could just set the max row group size smaller if you want greater row group granularity
   - For compressible data, more row groups will likely lead to larger files, which might actually be slower to read
   - Similar to the above, the reader is designed to amortise per-row group costs over many rows. This works less well with smaller row groups
   - It is possible to prune at a more granular level, it just hasn't been implemented yet -  https://github.com/apache/arrow-rs/issues/1191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1626: Expose ArrowWriter row group flush in public API

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1626: Expose ArrowWriter row group flush in public API
URL: https://github.com/apache/arrow-rs/issues/1626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Cheappie commented on issue #1626: Expose ArrowWriter row group flush in public API

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #1626:
URL: https://github.com/apache/arrow-rs/issues/1626#issuecomment-1114147020

   Hi @tustvold, I wonder whether we need test case for new function or it can go without any because it has single line body that delegates all logic to the other functions ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org