You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "parthchandra (via GitHub)" <gi...@apache.org> on 2023/02/24 17:38:47 UTC

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1444106912

   > Hi, I am very interested in this optimization and just have some questions when testing in a cluster with 4nodes/96 cores using spark3.1. Unfortunately, I see little improvement.
   
   You're likely to see improvement in cases where file i/o is the bottleneck. Most TPC-DS queries are join heavy and you will see little improvement there. You might do better with TPC-H. 
   
   > I am confused than whether it is necessary to keep spark.sql.parquet.enableVectorizedReader = false in spark when testing with spark 3.1 and how can I set the parquet buffer size. 
   
   It's probably best to keep the parquet (read) buffer size untouched.
   
   You should keep `spark.sql.parquet.enableVectorizedReader = true` irrespective of this. This feature improves I/O speed of reading raw data. The Spark vectorized reader kicks in after data is read from storage and converts the raw data into Spark's internal columnar representation and is faster than the row based version.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org