You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/10/05 05:27:00 UTC
[jira] [Assigned] (SPARK-36926) Discrepancy in Q22 of TPCH for Spark 3.2

     [ https://issues.apache.org/jira/browse/SPARK-36926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-36926:
------------------------------------

    Assignee:     (was: Apache Spark)

> Discrepancy in Q22 of TPCH for Spark 3.2
> ----------------------------------------
>
>                 Key: SPARK-36926
>                 URL: https://issues.apache.org/jira/browse/SPARK-36926
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Aravind Patnam
>            Priority: Critical
>
> When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the number of rows returned by the query. This was tested with both AQE on and off. All the other queries were matching in results. Below is the results that we got when testing Q22 on 3.2: 
>  
> {code:java}
>   "results": [
>     {
>       "name": "Q22",
>       "mode": "collect",
>       "parameters": {},
>       "joinTypes": [
>         "SortMergeJoin"
>       ],
>       "tables": [
>         "customer"
>       ],
>       "parsingTime": 0.016522,
>       "analysisTime": 0.004132,
>       "optimizationTime": 39.173868,
>       "planningTime": 23.10939,
>       "executionTime": 13762.183844,
>       "result": 0,
>       "breakDown": [],
>       "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC NULLS FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     +- 'UnresolvedRelation [customer], [], false\n            :  +- 'Project [*]\n            :     +- 'Filter ('o_custkey = 'c_custkey)\n            :        +- 'UnresolvedRelation [orders], [], false\n            +- 'UnresolvedRelation [customer], [], false\n\n== Analyzed Logical Plan ==\ncntrycode: string, numcust: bigint, totacctbal: decimal(22,2)\nSort [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS totacctbal#151]\n   +- SubqueryAlias custsale\n      +- Project [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n         +- Filter ((substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17) AND (cast(c_acctbal#11 as decimal(16,6)) > cast(scalar-subquery#148 [] as decimal(16,6)))) AND NOT exists#149 [c_custkey#6L])\n            :  :- Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n            :  :  +- Filter ((cast(c_acctbal#160 as decimal(12,2)) > cast(0.00 as decimal(12,2))) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n            :  :        +- Relation tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162] orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter (o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24] orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n               +- Relation tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13] orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n         :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n         :     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 0.00) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         :     :           +- Relation tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162] orc\n         :     +- Relation tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13] orc\n         +- Project [o_custkey#17L]\n            +- Relation tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24] orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) HashAggregate(keys=[cntrycode#147], functions=[count(1), sum(c_acctbal#11)], output=[cntrycode#147, numcust#150L, totacctbal#151])\n      +- Exchange hashpartitioning(cntrycode#147, 200), ENSURE_REQUIREMENTS, [id=#310]\n         +- *(5) HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin [c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort [c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n                  :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n                  :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n                  :           :  +- Subquery scalar-subquery#148, [id=#249]\n                  :           :     +- *(2) HashAggregate(keys=[], functions=[avg(c_acctbal#160)], output=[avg(c_acctbal)#154])\n                  :           :        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#245]\n                  :           :           +- *(1) HashAggregate(keys=[], functions=[partial_avg(c_acctbal#160)], output=[sum#175, count#176L])\n                  :           :              +- *(1) Project [c_acctbal#160]\n                  :           :                 +- *(1) Filter ((isnotnull(c_acctbal#160) AND (c_acctbal#160 > 0.00)) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n                  :           :                    +- *(1) ColumnarToRow\n                  :           :                       +- FileScan orc tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] Batched: true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 0.00), substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           +- *(1) ColumnarToRow\n                  :              +- FileScan orc tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13] Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n                  +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n                     +- Exchange hashpartitioning(o_custkey#17L, 200), ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project [o_custkey#17L]\n                           +- *(3) ColumnarToRow\n                              +- FileScan orc tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<o_custkey:bigint>\n"
>     }
>   ]
> }
> {code}
> The correct result should be 7 rows returned, not 0 as shown above. This can be confirmed by testing the same exact query against the hive table using Presto/Hive, which both return 7. Additionally, this link also shows that it should be 7 [https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org