You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "YoungRX (via GitHub)" <gi...@apache.org> on 2023/04/10 13:33:57 UTC

[GitHub] [arrow] YoungRX opened a new issue, #35000: How to make Scanner read parquet files faster?

YoungRX opened a new issue, #35000:
URL: https://github.com/apache/arrow/issues/35000

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I used `ParquetFileReader` in `/parquet/file_reader.h` to read parquet file before. And I implemented the predicate push-down myself. 
   
   Now I am using 8.0.0. And I update the code to use `AsyncScanner::ToRecordBatchReader()` and `ScannerRecordBatchReader::ReadNext()` to read the parquet files. So I can use the predicate pushdown implemented internally by arrow. 
   
   However, my code environment does not support multithreading, so I set up the following in `ScanOptions`:
   > use_threads = false;
   > batch_readahead = 0;
   > batch_size = 1000;
   > Other settings such as filter, projection, dataset_schema are set as required
   
   As a result, when scanning the same parquet file with the same sql statement, the new code takes 1.5 to 2.0 times longer to execute than the old code. I think it is unreasonable. 
   
   Is there an option I have that is not set correctly?
   Or is it because multithreading and readahead are not enabled?
   Do you have a way to make `Scanner` faster?
   
   
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] YoungRX commented on issue #35000: [C++] How to make Scanner read parquet files faster?

Posted by "YoungRX (via GitHub)" <gi...@apache.org>.
YoungRX commented on issue #35000:
URL: https://github.com/apache/arrow/issues/35000#issuecomment-1520284135

   Thanks. I see. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] YoungRX closed issue #35000: [C++] How to make Scanner read parquet files faster?

Posted by "YoungRX (via GitHub)" <gi...@apache.org>.
YoungRX closed issue #35000: [C++] How to make Scanner read parquet files faster?
URL: https://github.com/apache/arrow/issues/35000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35000: [C++] How to make Scanner read parquet files faster?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35000:
URL: https://github.com/apache/arrow/issues/35000#issuecomment-1516358571

   That is a rather small batch size.  I don't know how much profiling or focus we've had on sizes that small.  You might try something larger, like 32k.  I would expect the scanner to be slightly slower than ParquetFileReader directly but not 2x.  If you don't support any multithreading, not even I/O readahead, then what advantages are you hoping to gain by using the scanner instead of ParquetFileReader?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org