You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/07 08:50:28 UTC

[GitHub] [iceberg] smallx opened a new issue #2562: Spark on iceberg table is slower than spark on hive parquet table

smallx opened a new issue #2562:
URL: https://github.com/apache/iceberg/issues/2562

Through test and comparison, we find that spark on iceberg table is slower than spark on hive parquet table, and its performance is reduced by about half. After optimizing some parameters, iceberg's performance is improved, but it is still slower than simple parquet. The optimized parameters are as follows.

```sql
ALTER TABLE iceberg_table SET TBLPROPERTIES ('write.parquet.compression-codec'='snappy');
ALTER TABLE iceberg_table SET TBLPROPERTIES ('read.parquet.vectorization.enabled'='true');
ALTER TABLE iceberg_table SET TBLPROPERTIES ('read.parquet.vectorization.batch-size'='100000');
ALTER TABLE iceberg_table SET TBLPROPERTIES ('commit.manifest.min-count-to-merge'='2');
```

It seems that iceberg's vectorization reading performance is not as good as spark's. There are too many for loops and function calls in vectorization reading code. See the following code.

https://github.com/apache/iceberg/blob/a2103b7131bb039c531a2c9f70c7c8b9fe9715ac/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java#L42-L70

Perhaps my guess is wrong, please help to analyze the reasons. Thanks very much.

Version info:
- spark 3.0.2
- iceberg 0.11.1

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2562: Spark on iceberg table is slower than spark on hive parquet table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2562:
URL: https://github.com/apache/iceberg/issues/2562#issuecomment-837394219


   Do you have a reproducible benchmark? That would be helpful but I try not to guess on root causes of performance differences without benchmarks and instrumented code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org