You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/24 10:04:28 UTC

[GitHub] [iceberg] andrey-mindrin opened a new issue #4217: Slow performance on TPC-DS tests

andrey-mindrin opened a new issue #4217:
URL: https://github.com/apache/iceberg/issues/4217


   We’ve tested Iceberg performance vs Hive format by using Spark TPC-DS performance tests (scale factor 1000) from Databricks and found 50% less performance in Iceberg tables.
   
   Environment:
   1. On premises cluster which runs Spark 3.1.2 with Iceberg 0.13.0 with the same number executors, cores, memory, etc.
   2. Parquet codec snappy
   3. Tables were partitioned like in original Hive tables
   4. Tables were COW and they were created in Spark from Hive tables with CTAS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1050935112


   Could you paste what q7-v2.4 is? My guess would be if the join types are
   swapped then the size estimation is occurring differently in the Hive and
   Iceberg code *or* the pushdown is not working correctly. Spark should be
   planning the same logical join in both cases, the only difference should be
   the transition to a physical plan which would be dependent on the source
   responding with how much information is on each side of the join.
   
   If the predicate pushdown isn't working that would happen, or it could be
   that our size estimate code has a bug with certain predicates. A full Spark
   EXPLAIN of the query should show the predicate portion at least, the Spark
   UI should have the input sizes reported.
   
   One more note: Hive has a much more general notion of filter compatibility and it could be that
   Iceberg is failing to pushdown predicates because it is more strict. For example if the
   predicate uses a Int but the column is a String, Hive may be implicitly converting and applying
   the predicate when Iceberg does not ... Would need to see the plan for the hive and Iceberg tables to know for sure
   
   On Fri, Feb 25, 2022 at 2:31 AM andrey-mindrin ***@***.***>
   wrote:
   
   > Additional information. Total size of data is about 300 Gb in parquet,
   > tried different options for iceberg tables including vectorization,
   > distributed by hash and range. Iceberg tables created with a query like
   >
   > create table ice_table using iceberg partition by (column) location
   > ‘hdfs://location’
   > tblproperties(‘vectorization-enabled’=true,’write.parquet.compression-codec’=’snappy’,’
   > write.distribution-mode’=’hash’) as select * from hive_table
   >
   > Analyzing query plans in Spark for query q7-v2.4 shows different join
   > types – SortMergeJoin in Iceberg and BroadcastHashJoin in Hive tables.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/4217#issuecomment-1050642335>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YNQSKOPTIUWGU2ZQ63U44473ANCNFSM5PG44ZUQ>
   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1050935112


   Could you paste what q7-v2.4 is? My guess would be if the join types are
   swapped then the size estimation is occurring differently in the Hive and
   Iceberg code *or* the pushdown is not working correctly. Spark should be
   planning the same logical join in both cases, the only difference should be
   the transition to a physical plan which would be dependent on the source
   responding with how much information is on each side of the join.
   
   If the predicate pushdown isn't working that would happen, or it could be
   that our size estimate code has a bug with certain predicates. A full Spark
   EXPLAIN of the query should show the predicate portion at least, the Spark
   UI should have the input sizes reported.
   
   On Fri, Feb 25, 2022 at 2:31 AM andrey-mindrin ***@***.***>
   wrote:
   
   > Additional information. Total size of data is about 300 Gb in parquet,
   > tried different options for iceberg tables including vectorization,
   > distributed by hash and range. Iceberg tables created with a query like
   >
   > create table ice_table using iceberg partition by (column) location
   > ‘hdfs://location’
   > tblproperties(‘vectorization-enabled’=true,’write.parquet.compression-codec’=’snappy’,’
   > write.distribution-mode’=’hash’) as select * from hive_table
   >
   > Analyzing query plans in Spark for query q7-v2.4 shows different join
   > types – SortMergeJoin in Iceberg and BroadcastHashJoin in Hive tables.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/4217#issuecomment-1050642335>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YNQSKOPTIUWGU2ZQ63U44473ANCNFSM5PG44ZUQ>
   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] andrey-mindrin commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
andrey-mindrin commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1062800028


   I can confirm that iceberg performance on Spark 3.2 is similar to Hive tables. It seems the problem was in Spark 3.1 only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1049995454


   Could you add some more details? Like what particular tests were slower? How did you configure the Iceberg Table? What kind of storage did you use?
   
   It may be helpful to see the exact commands run to replicate the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu edited a comment on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
ConeyLiu edited a comment on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1061735573


   We also have benchmarked iceberg with TPCDS. And got the following finds:
   1. As @wypoon said, spark reading parquet and reading iceberg using different relation size estimation, which leads to the different table plan. Such as BroadcastJoin to SortMergeJoin.
   2. Spark reading parquet with vectorized data reading by default. However, we have just enabled it by default in iceberg recently.  You could enable it by yourself. This could improve data reading a lot. And I have a pr (https://github.com/apache/iceberg/pull/3249) for optimizing iceberg parquet decimal vectorized reading.
   3. Spark supports DPP for build-in datasource while it supports iceberg-like datasource since spark 3.2. This could influence the performance a lot from TPCDS side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] andrey-mindrin commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
andrey-mindrin commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1050642335


   Additional information. Total size of data is about 300 Gb in parquet, tried different options for iceberg tables including vectorization, distributed by hash and range. Iceberg tables created with a query like
   
   `create table ice_table using iceberg partition by (column) location ‘hdfs://location’ tblproperties(‘vectorization-enabled’=true,’write.parquet.compression-codec’=’snappy’,’ write.distribution-mode’=’hash’) as select * from hive_table `
   
   Analyzing query plans in Spark for query q7-v2.4 shows different join types – SortMergeJoin in Iceberg and BroadcastHashJoin in Hive tables. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] andrey-mindrin commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
andrey-mindrin commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1050040773


   Used https://github.com/databricks/spark-sql-perf, tested on HDFS. Try for example q7-v2.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
wypoon commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1058730372


   I have done some experiments along this line too. The way I created the Iceberg tables from the TPD-DS Hive tables is using the snapshot procedure to create them in a different database:
   ```
   spark.sql("create database tpcds_iceberg")
   spark.sql("use tpcds")
   val tables = spark.sql("show tables")
   tables.collect().map(r => r(1).toString).foreach(t =>
     spark.sql(s"call spark_catalog.system.snapshot('tpcds.$t', 'tpcds_iceberg.$t')")
   )
   ```
   This creates Iceberg tables backed by the same underlying data. As we're not writing to the tables, this does not create any problems.
   For the same value of `spark.sql.autoBroadcastJoinThreshold` Spark will use a broadcast join in some of the TPC-DS queries on Hive tables but use a SortMergeJoin on the Iceberg tables. This is because the way Spark estimates the size of the relation for the native table case is different from the way Iceberg estimates the size. For the native table case, Spark uses file size in its estimation. File size can be a significant underestimate as in your case, where you are using Snappy-compressed Parquet files, as most folks don't even set `spark.sql.sources.fileCompressionFactor` (which defaults to 1.0). Iceberg estimates the size of the relation by multiplying the estimated width of the requested columns by the number of rows. In my original commit for https://github.com/apache/iceberg/pull/3038, I used the same approach to estimating the size of the relation that Spark uses for `FileScan`s, but @rdblue suggested to use the approach actually adopted.
   To get Spark to use broadcast joins for the Iceberg tables, you can set a higher value for `spark.sql.autoBroadcastJoinThreshold`. The problem is that it is hard to compare apples with apples in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on issue #4217: Slow performance on TPC-DS tests

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1061735573


   We also have benchmarked iceberg with TPCDS. And got the following finds:
   1. As @wypoon said, spark reading parquet and reading iceberg using different relation size estimation, which leads to the different table plan. Such as BroadcastJoin to SortMergeJoin.
   2. Spark reading parquet with vectorized data reading by default. However, we have just enabled it by default in iceberg recently.  You could enable it by yourself. This could improve data reading a lot. And I have a pr (https://github.com/apache/iceberg/pull/3249) for optimizing iceberg parquet decimal vectorized reading.
   3. Spark supports DDP for build-in datasource while it supports iceberg-like datasource since spark 3.2. This could influence the performance a lot from TPCDS side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org