You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/27 08:55:36 UTC

[GitHub] aokolnychyi commented on issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

aokolnychyi commented on issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
URL: https://github.com/apache/spark/pull/23171#issuecomment-467778314

I thought a bit more about the place where this logic should go and would like to discuss this with the community once again. For now, I keep some of the review comments open.

There was a consensus on having `OptimizedIn` and converting `In` to `OptimizedIn` if all values are literals. In my view, the main idea of `InSet` was to optimize cases when the list of values is long rather than to optimize cases when all values are literals. If we decide to ignore the size of the set while converting `In` to `InSet` (or `OptimizedIn`), we will worsen the performance for small numbers of structs, arrays, timestamps, small decimals, floats, doubles, longs.

I believe having a threshold to convert `In` to `InSet` still makes sense. If we want to convert `In` to `OptimizedIn` if simply all values are literals, `OptimizedIn` will still need to generate if-else code in some cases to match the performance. I don’t think we want to do that.

Locally, I tried Spark `OpenHashSet` and primitive collections from `fastutils` in order to solve the boxing issue in `InSet`. Both options significantly decreased the memory consumption and `fastutils` improved the time compared to `HashSet` from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements. I tested this threshold for every data type but we can still lower it a bit to be absolutely sure.

I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into `InSet` and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separate `InSwitch` expression. However, this would mean we need to adapt places (e.g., `DataSourceStrategy`).

As I said, let’s agree on where this logic should be and then I’ll address other comments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org