You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/19 19:08:02 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request #1617: Async ParquetExec

tustvold opened a new pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617


   **I've not thoroughly tested this yet, it may represent a performance regression, and will likely need more memory**
   
   # Which issue does this PR close?
   
   Closes #TBD.
   
    # Rationale for this change
   
   Avoids needing to use a blocking threadpool for reading Parquet data, and opens the door for fetching data from object storage.
   
   # What changes are included in this PR?
   
   Updates ParquetExec to use the POC async ParquetRecordBatchStream in https://github.com/apache/arrow-rs/pull/1154 
   
   # Are there any user-facing changes?
   
   Yes, I had to alter the ObjectReader trait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1038059315


   This can probably be refreshed after  https://github.com/apache/arrow-datafusion/pull/1775 is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold edited a comment on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040101814


   I'm going to wait until after #1738 is merged and I have time to do some comparative investigation. Async is certainly not free as the block structure of parquet is not particularly amenable to streaming. Prior to merge I'd like a better handle on what the trade-offs here are, I'd expect memory usage to be higher, but performance may be better, not sure :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1066672253


   Running the benchmarks against a backported https://github.com/apache/arrow-rs/pull/1418 show the performance of this to be significantly worse than the current Sync reader (~2x). It is possible this is just a simple oversight with a simple fix, but I'm unlikely to have time to look into it this week. Async is far from free and so I'd expected the performance to be somewhat worse, but not that much worse...
   
   Hopefully I will have some time next week, otherwise I'll keep this pootling along on the backburner until it is ready for prime-time :smile: 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040101814


   I'm going to wait until after #1738 is merged and do some comparative investigation. Async is certainly not free as the block structure of parquet is not particularly amenable to streaming. Prior to merge I'd like a better handle on what the trade-offs here are, I'd expect memory usage to be higher, but performance may be better, not sure :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040101814


   I'm going to wait until after #1738 is merged and do some comparative investigation. Async is certainly not free as the block structure of parquet is not particularly amenable to streaming. Prior to merge I'd like a better handle on what the trade-offs here are, I'd expect memory usage to be higher, but performance may be better, not sure :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1038059315


   This can probably be refreshed after  https://github.com/apache/arrow-datafusion/pull/1775 is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold edited a comment on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040101814


   I'm going to wait until after #1738 is merged and I have time to do some comparative investigation. Async is certainly not free as the block structure of parquet is not particularly amenable to streaming. Prior to merge I'd like a better handle on what the trade-offs here are, I'd expect memory usage to be higher, but performance may be better, not sure :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040633827


   https://github.com/apache/arrow-datafusion/pull/1738 is merged 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1617: Async ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1617:
URL: https://github.com/apache/arrow-datafusion/pull/1617#issuecomment-1040633827






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org