You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 12:55:00 UTC
[jira] [Resolved] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

     [ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-29335.
----------------------------------
    Resolution: Invalid

Questions should go to mailing list or stackoverflow. You could have a better answer than this

> Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-29335
>                 URL: https://issues.apache.org/jira/browse/SPARK-29335
>             Project: Spark
>          Issue Type: Question
>          Components: Optimizer
>    Affects Versions: 2.3.0
>         Environment: We tried to execute the same using Spark-sql and Thrify server using SQLWorkbench but we are not able to use the stats.
>            Reporter: Srini E
>            Priority: Major
>         Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<FPID: decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(absref#4569),(absref#4569 = Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<IMNEM: string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter (((('a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter ((((eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = cast(fpid#4567 as decimal(38,0)))) && ((absref#4569 = Y2017) && (year#4545 = 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = cast(cast(29940 as decimal(5,0)) as decimal(10,0)))) && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546] parquet
>  +- SubqueryAlias a13
>  +- SubqueryAlias t_fdata_sundar
>  +- Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569] parquet
>  
> == Optimized Logical Plan ==
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as decimal(38,0))))
>  :- Project [FPID#4523, EID#4542, PTYPE#4546]
>  : +- Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
>  : +- Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546] parquet
>  +- Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
>  +- Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
>  +- Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569] parquet
>  
> == Physical Plan ==
> *(2) Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- *(2) BroadcastHashJoin [eid#4542, fpid#4523], [eid#4566, cast(fpid#4567 as decimal(38,0))], Inner, BuildLeft
>  :- BroadcastExchange HashedRelationBroadcastMode(List(input[1, decimal(10,0), true], input[0, decimal(38,0), true]))
>  : +- *(1) Project [FPID#4523, EID#4542, PTYPE#4546]
>  : +- *(1) Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
>  : +- *(1) FileScan parquet arrow.t_fperiods_sundar[FPID#4523,PID#4527,EID#4542,YEAR#4545,PTYPE#4546] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FPERIODS_S..., PartitionFilters: [], PushedFilters: [IsNotNull(PTYPE), IsNotNull(YEAR), IsNotNull(EID), EqualTo(YEAR,2017), EqualTo(PTYPE,A), EqualTo..., ReadSchema: struct<FPID:decimal(38,0),PID:string,EID:decimal(10,0),YEAR:int,PTYPE:string>
>  +- *(2) Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
>  +- *(2) Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
>  +- *(2) FileScan parquet arrow.t_fdata_sundar[IMNEM#4548,FVALUE#4552,EID#4566,FPID#4567,ABSREF#4569] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FDATA_SUNDAR], PartitionFilters: [], PushedFilters: [IsNotNull(ABSREF), EqualTo(ABSREF,Y2017), IsNotNull(FPID), IsNotNull(EID), EqualTo(EID,29940)], ReadSchema: struct<IMNEM:string,FVALUE:string,EID:decimal(10,0),FPID:decimal(10,0),ABSREF:string>
> Time taken: 0.35 seconds, Fetched 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org