You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/04/08 14:30:53 UTC

[GitHub] [incubator-iceberg] aokolnychyi commented on issue #105: Basic Benchmarks for Iceberg Spark Data Source

aokolnychyi commented on issue #105: Basic Benchmarks for Iceberg Spark Data Source
URL: https://github.com/apache/incubator-iceberg/pull/105#issuecomment-480856258

Thanks for your feedback, @danielcweeks!

I think one of the most important questions is how we want to use such benchmarks.

There might be a few options.
1. Contributors verify the performance impact of their fixes/improvements by running a particular benchmark with and without their change.
2. Users run benchmarks to try out Iceberg and see how it compares to the built-in Spark file source.
3. Committers run these benchmarks before every release to see the performance difference and prevent any degradation.

It would be great to support option 3 but I believe we are far from this. Instead, we can try to focus on a generic framework for option 1/2. Ideally, we should be able to use it locally as well as on existing data sets (e.g., point to a dataset in HDFS).

I agree the file skipping benchmarks are bit controversial. The main idea was to show that Iceberg doesn't touch irrelevant files and boosts the performance for highly selective queries. However, these benchmarks would make more sense on real data. So, we can either remove the file skipping benchmarks completely or just try to make them generic enough so that users can also run them on real datasets.

It definitely makes sense to have a benchmark for Parquet readers alone. That would be a fair comparison. However, I think it is still useful to see the end-to-end performance, which covers a lot of aspects. For example, the read/write path for Spark Data Source V2 can be a bit slower. We need to catch such things and fix them.

Excluding the results makes sense to me.

@danielcweeks @prodeezy @rdblue, it would be really great to hear your opinion on how we see the future of such benchmarks. If we decide to have them, I will update the PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org