You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hemant Bhanawat (JIRA)" <ji...@apache.org> on 2016/04/01 06:44:25 UTC

[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

    [ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221145#comment-15221145 ] 

Hemant Bhanawat commented on SPARK-13900:
-----------------------------------------

As I understand, on table A and B, a nested loop join (that will produce m X n rows) is performed and than each row is evaluated to see if any of the condition is met. You are asking that Spark should instead do a BroadcastHashJoin on the equality conditions in parallel and then union the results like you are doing in a different query.

If we leave aside parallelism for a moment, theoretically, time taken for nested loop join would vary little when the number of conditions are increased while the time taken for the solution that you are suggesting would increase linearly with number of conditions. So, when number of conditions are too many, nested loop join would be faster than the solution that you suggest. Now the question is, how should Spark decide when to do what? 

I think this JIRA can be closed. 

> Spark SQL queries with OR condition is not optimized properly
> -------------------------------------------------------------
>
>                 Key: SPARK-13900
>                 URL: https://issues.apache.org/jira/browse/SPARK-13900
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Ashok kumar Rajendran
>
> I have a large table with few billions of rows and have a very small table with 4 dimensional values. All the data is stored in parquet format. I would like to get rows that match any of these dimensions. For example,
> Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = B.dimension4.
> The query plan takes this as BroadcastNestedLoopJoin and executes for very long time.
> If I execute this as Union queries, it takes around 1.5mins for each dimension. Each query internally does BroadcastHashJoin.
> Select field1, field2 from A, B where A.dimension1 = B.dimension1
> UNION ALL
> Select field1, field2 from A, B where A.dimension2 = B.dimension2
> UNION ALL
> Select field1, field2 from A, B where  A.dimension3 = B.dimension3
> UNION ALL
> Select field1, field2 from A, B where  A.dimension4 = B.dimension4.
> This is obviously not an optimal solution as it makes multiple scanning at same table but it gives result much better than OR condition. 
> Seems the SQL optimizer is not working properly which causes huge performance impact on this type of OR query.
> Please correct me if I miss anything here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org