You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/05 10:35:46 UTC

[GitHub] [iceberg] shardulm94 opened a new pull request #2218: ORC: Grow list and map child vectors with a growth factor of 3

shardulm94 opened a new pull request #2218:
URL: https://github.com/apache/iceberg/pull/2218


   A performance issue was reported in [Iceberg Slack](https://the-asf.slack.com/archives/CF01LKV9S/p1612381780127800) where ORC writer performed about 400x worse than the Parquet writer for a schema with a list having a large number of elements.
   
   On profiling I found that the majority of time was spent in growing child element vectors for list and map types. Currently we only grow these vectors by the size required to fit the current row. However this causes growth to be triggered for every non-empty row, subsequently resulting in an array copy underneath.
   
   We now increase the vectors by a growth factor of 3 to avoid frequent array allocations and copies. The growth factor 3 was taken from [ORC's Mapreduce writer](https://github.com/apache/orc/blob/2981629a13b9d262570c73974bc09dabcad0184d/java/mapreduce/src/java/org/apache/orc/mapred/OrcMapredRecordWriter.java#L59).
   
   I have added the table schema and data used for profiling as JMH tests below. After the patch, performance of ORC writer increased substantially. However the performance is still much behind Parquet. The remaining performance difference is attributable to the use of Red-black tree in the implementation of dictionary encoding in ORC. Turning off dictionary encoding results performance better than with Parquet. Recently, there has been activity in [ORC-50](https://issues.apache.org/jira/browse/ORC-50) to use hash tables instead of red-black trees. It would be interesting to see the impact of that change on these JMH tests.
   
   
   ### ORC Before:
   ```
   Benchmark                                                                    Mode  Cnt    Score   Error  Units
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg2000                  ss    5  136.903 ± 2.239   s/op
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg20000                 ss    5  245.252 ± 9.533   s/op
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg20000DictionaryOff    ss    5  147.087 ± 7.717   s/op
   ```
   
   ### ORC After:
   ```
   Benchmark                                                                   Mode  Cnt    Score   Error  Units
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg2000                 ss    5    8.696 ± 0.429   s/op
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg20000                ss    5  116.346 ± 5.737   s/op
   IcebergSourceNestedListORCDataWriteBenchmark.writeIceberg20000DictionaryOff   ss    5   15.348 ± 0.888   s/op
   ```
   
   ### Parquet:
   ```
   Benchmark                                                                   Mode  Cnt    Score   Error  Units
   IcebergSourceNestedListParquetDataWriteBenchmark.writeIceberg2000             ss    5    2.692 ± 0.208   s/op
   IcebergSourceNestedListParquetDataWriteBenchmark.writeIceberg20000            ss    5   24.711 ± 0.335   s/op
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2218: ORC: Grow list and map child vectors with a growth factor of 3

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2218:
URL: https://github.com/apache/iceberg/pull/2218


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2218: ORC: Grow list and map child vectors with a growth factor of 3

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2218:
URL: https://github.com/apache/iceberg/pull/2218#issuecomment-774379247


   Thanks, @shardulm94! Great to have this fixed!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org