You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/24 08:12:11 UTC

[GitHub] [arrow-rs] jiacai2050 opened a new issue, #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

jiacai2050 opened a new issue, #2916:
URL: https://github.com/apache/arrow-rs/issues/2916

   **Which part is this question about**
   API Usage & Perf
   
   **Describe your question**
   
   I create two benchmark based on [example code](https://docs.rs/parquet/latest/parquet/arrow/async_reader/index.html), and in my environment, this is what I got
   - ParquetRecordBatchReader cost 4s
   - ParquetRecordBatchStream cost 5s
   
   The tested data is:
   - total rows: 40935755
   - row group: 4998
   
   This is the schema of parquet file
   ```
   message arrow_schema {
     required int64 tsid (INTEGER(64,false));
     required int64 enddate (TIMESTAMP(MILLIS,false));
     optional int64 id;
     optional int64 code;
     optional binary source (STRING);
     optional int64 innercode;
     optional int64 del;
     optional int64 jsid;
     optional int64 updatetime (TIMESTAMP(MILLIS,false));
     optional double weight;
   }
   
   ```
   
   **Additional context**
    I dig into Parquet's source code, and find they both call `build_array_reader` to read parquet file, so the difference may above this layer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2916:
URL: https://github.com/apache/arrow-rs/issues/2916#issuecomment-1289456560

   This is expected, see the investigation under https://github.com/apache/arrow-rs/issues/1473.
   
   The TLDR is that in the absence of resource contention, synchronous blocking code will often outperform the corresponding asynchronous code. This is especially true of file IO, where there aren't stable non-blocking operating system APIs, and so tokio implements this by offloading the task of reading from the files to a separate blocking thread pool. Eventually projects like [tokio-uring](https://github.com/tokio-rs/tokio-uring) may address this.
   
   The advantage of async comes where either:
   
   * You are communicating over some network connection, e.g. to object storage
   * There is resource contention, where instead of blocking the thread on IO, you could be getting on with processing some other part of the query
   
   Async is about efficiently multiplexing work, if you don't have anything to multiplex, you aren't going to see a return from it
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2916:
URL: https://github.com/apache/arrow-rs/issues/2916#issuecomment-1291092634

   > As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.
   
   We used to do something in DataFusion, however, on a contended system that is performing other query processing tasks it was not found to make an appreciable difference and so moved away from this approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jiacai2050 closed issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Posted by GitBox <gi...@apache.org>.

jiacai2050 closed issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader
URL: https://github.com/apache/arrow-rs/issues/2916


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jiacai2050 commented on issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Posted by GitBox <gi...@apache.org>.

jiacai2050 commented on issue #2916:
URL: https://github.com/apache/arrow-rs/issues/2916#issuecomment-1289932431

   Thanks for quick reply, your point makes sense to me. 
   
   Just don't expect it will decrease by 20%, maybe `io-uring` is one solution, will try it in future development.
   
   As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jiacai2050 commented on issue #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Posted by GitBox <gi...@apache.org>.

jiacai2050 commented on issue #2916:
URL: https://github.com/apache/arrow-rs/issues/2916#issuecomment-1292160909

   Perf in practice is hard to measure, so many factors to consider.
   
   In my case, I re-test a parquet file with 104022899 row(4G), the cost between them are 10s vs 15s, totally 50% loss. 
   
   Hope this data here can help others with similar issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org