You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Rajesh Balamohan <rb...@apache.org> on 2022/10/03 00:04:39 UTC

Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Based on the plan, filtered output in map-1 had mis-estimates and also
groupby operators have large misestimates.

This is causing the number of reducers to be estimated as "4" which is less
for this query. Due to the partition factor of tez, it ends up with 8
reducer slots at runtime for hive 3.x.

Here are a few tickets which can help, but note that it is *very risky* to
backport pointed patches for stats and CBO without complete context. This
may have an adverse impact on other queries.

https://issues.apache.org/jira/browse/HIVE-23684
https://issues.apache.org/jira/browse/HIVE-20432
https://issues.apache.org/jira/browse/HIVE-23485

On Sun, Oct 2, 2022 at 1:56 PM Battula, Brahma Reddy <bb...@visa.com>
wrote:

> + Attaching the hs2 logs also.
>
>
>
> *From: *"Battula, Brahma Reddy" <bb...@visa.com>
> *Date: *Sunday, 2 October 2022 at 2:16 AM
> *To: *"user@hive.apache.org" <us...@hive.apache.org>
> *Subject: *TPCDS query degrade with hive-3.1.2 because of wrong
> estimation for reducers
>
>
>
> Hi All,
>
>
>
> We’ve ran TPCDS queries against hive-3.1.2 and trunk(little older
> version). (Attached files suffix “a” is trunk and “v” is 3.1.2)
>
>
>
> The query execution time is higher in hive-3.1.2 as number of the reducers
> estimated is less (8) as compared to trunk version where it’s 46.
>
>
>
> All the hive/tez/Yarn configs are same in both clusters. Even h/w
> resources are same. And query planner is also same.
>
>
>
> *The stats in reduce sink phase are not look same.*
>
>
>
> *HIVE_TRUNK_CODE* - 2022-09-26T05:58:23,786 INFO
> [07243354-f941-419d-8908-45009762e67d HiveServer2-Handler-Pool:
> Thread-168]: optimizer.ConvertJoinMapJoin (:()) - Join input#1;
> onlineDataSize:   9628; Statistics: Num rows:  359 Data size: 4308  Basic
> stats: COMPLETE Column stats: COMPLETE
>
> *HIVE_3.1.2_CODE* - 2022-09-27T03:39:45,116 INFO
> [2fd1493c-f1a0-4874-acac-58f28e9c21ea HiveServer2-Handler-Pool:
> Thread-134]: optimizer.ConvertJoinMapJoin (:()) - Join input#1;
> onlineDataSize: 325856; Statistics: Num rows: 8116 Data size: 97392 Basic
> stats: COMPLETE Column stats: COMPLETE
>
>
>
> Any idea how the reducers getting underestimated.?
>
>
>
>
>
>
>
>
>

Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Posted by "Battula, Brahma Reddy" <bb...@visa.com.INVALID>.

Hi Rajesh,

Thanks for spending time on this.
We tried applying these patches but unfortunately it didn't help



HIVE-23485: Bound GroupByOperator stats using largest NDV among columns

HIVE-23684: Large underestimation in NDV stats when input and join cardinality ratio is big

HIVE-20432: Rewrite BETWEEN to IN for integer types for stats estimation

HIVE-20537 : Multi-column joins estimates with uncorrelated columns different in CBO and Hive

HIVE-20636: Improve number of null values estimation after outer join

HIVE-20366: TPC-DS query78 stats estimates are off for is null filter

HIVE-18140: Partitioned tables statistics can go wrong in basic stats mixed case

HIVE-20660: added patch part 2

HIVE-20660: added patch part 1

There was one observation from our side which we wanted to share and get your opinion on that. One of the reduce sink operator RS[11] which got less reducer assigned actually got the data size from it parents. The chain of edges looks something like this as per log:

TS[0]-FIL[73]-SEL[2]-RS[6]-JOIN[8]-SEL[9]-GBY[10]-RS[11]-GBY[12]-SEL[13]-RS[39]-JOIN[41]-FIL[42]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-RS[50]-SEL[51]-LIM[52]-FS[53]

TS[3]-FIL[74]-SEL[5]-RS[7]-JOIN[8]

Here two stream of data from TS[0] and TS[3] get passed all the way till RS[11] and beyond. What we have observed that the CountDistinct starts getting mismatches starting from filter operators FIL[73], FIL[74], however row count mismatch(in stats) starts mismatching from join operator JOIN[8] itself and the discrepancies gets carried forward all the way till RS[11]

From 3.1.2 log

2022-10-02T07:41:52,421 DEBUG [c7021e6c-6ebc-4b44-9349-79e834cf0ca3 HiveServer2-Handler-Pool: Thread-170]: annotation.StatsRulesProcFactory (:()) - [0] STATS-JOIN[8]:  numRows: 149188770 dataSize: 15299485696 basicStatsState: COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: bigint countDistincts: 1610 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= colName: _col2 colType: decimal(7,2) countDistincts: 2068 numNulls: 31639899 avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint countDistincts: 4 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

From trunk

2022-09-28T19:59:45,402 DEBUG [7c0dcbda-a89e-45cb-94f9-a85ccda010e5 HiveServer2-Handler-Pool: Thread-120]: annotation.StatsRulesProcFactory (:()) - [0] STATS-JOIN[8]:  numRows: 263995240 dataSize: 29994713856 basicStatsState: COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: bigint countDistincts: 2848 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= colName: _col2 colType: decimal(7,2) countDistincts: 3748 numNulls: 31639899 avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint countDistincts: 8 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

Does countDistincts play a role during join operation? I am also attaching the HS2 log from trunk code for better comparison.

Regards,
Brahma

From: Rajesh Balamohan <rb...@apache.org>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Monday, 3 October 2022 at 5:35 AM
To: "user@hive.apache.org" <us...@hive.apache.org>
Cc: "dev@hive.apache.org" <de...@hive.apache.org>
Subject: Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Based on the plan, filtered output in map-1 had mis-estimates and also groupby operators have large misestimates.

This is causing the number of reducers to be estimated as "4" which is less for this query. Due to the partition factor of tez, it ends up with 8 reducer slots at runtime for hive 3.x.

Here are a few tickets which can help, but note that it is very risky to backport pointed patches for stats and CBO without complete context. This may have an adverse impact on other queries.

https://issues.apache.org/jira/browse/HIVE-23684<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23684&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457516385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wid5oP8esFNYApmTYQ0AtVVsXZ1vyaLRMKjEZk8VLGo%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-20432<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-20432&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AZcLqHcduwJCNoSrPq3XULhuKyX1EueVrso7zNFFuWY%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-23485<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23485&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fEw7VqFhGZM52jdUN%2B4ZZf2bGdloWro2XLdPxwLr2i0%3D&reserved=0>

On Sun, Oct 2, 2022 at 1:56 PM Battula, Brahma Reddy <bb...@visa.com>> wrote:
+ Attaching the hs2 logs also.

From: "Battula, Brahma Reddy" <bb...@visa.com>>
Date: Sunday, 2 October 2022 at 2:16 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Hi All,

We’ve ran TPCDS queries against hive-3.1.2 and trunk(little older version). (Attached files suffix “a” is trunk and “v” is 3.1.2)

The query execution time is higher in hive-3.1.2 as number of the reducers estimated is less (8) as compared to trunk version where it’s 46.

All the hive/tez/Yarn configs are same in both clusters. Even h/w resources are same. And query planner is also same.

The stats in reduce sink phase are not look same.

HIVE_TRUNK_CODE - 2022-09-26T05:58:23,786 INFO  [07243354-f941-419d-8908-45009762e67d HiveServer2-Handler-Pool: Thread-168]: optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize:   9628; Statistics: Num rows:  359 Data size: 4308  Basic stats: COMPLETE Column stats: COMPLETE
HIVE_3.1.2_CODE - 2022-09-27T03:39:45,116 INFO  [2fd1493c-f1a0-4874-acac-58f28e9c21ea HiveServer2-Handler-Pool: Thread-134]: optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize: 325856; Statistics: Num rows: 8116 Data size: 97392 Basic stats: COMPLETE Column stats: COMPLETE

Any idea how the reducers getting underestimated.?

Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Posted by "Battula, Brahma Reddy" <bb...@visa.com>.

Hi Rajesh,

Thanks for spending time on this.
We tried applying these patches but unfortunately it didn't help



HIVE-23485: Bound GroupByOperator stats using largest NDV among columns

HIVE-23684: Large underestimation in NDV stats when input and join cardinality ratio is big

HIVE-20432: Rewrite BETWEEN to IN for integer types for stats estimation

HIVE-20537 : Multi-column joins estimates with uncorrelated columns different in CBO and Hive

HIVE-20636: Improve number of null values estimation after outer join

HIVE-20366: TPC-DS query78 stats estimates are off for is null filter

HIVE-18140: Partitioned tables statistics can go wrong in basic stats mixed case

HIVE-20660: added patch part 2

HIVE-20660: added patch part 1

There was one observation from our side which we wanted to share and get your opinion on that. One of the reduce sink operator RS[11] which got less reducer assigned actually got the data size from it parents. The chain of edges looks something like this as per log:

TS[0]-FIL[73]-SEL[2]-RS[6]-JOIN[8]-SEL[9]-GBY[10]-RS[11]-GBY[12]-SEL[13]-RS[39]-JOIN[41]-FIL[42]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-RS[50]-SEL[51]-LIM[52]-FS[53]

TS[3]-FIL[74]-SEL[5]-RS[7]-JOIN[8]

Here two stream of data from TS[0] and TS[3] get passed all the way till RS[11] and beyond. What we have observed that the CountDistinct starts getting mismatches starting from filter operators FIL[73], FIL[74], however row count mismatch(in stats) starts mismatching from join operator JOIN[8] itself and the discrepancies gets carried forward all the way till RS[11]

From 3.1.2 log

2022-10-02T07:41:52,421 DEBUG [c7021e6c-6ebc-4b44-9349-79e834cf0ca3 HiveServer2-Handler-Pool: Thread-170]: annotation.StatsRulesProcFactory (:()) - [0] STATS-JOIN[8]:  numRows: 149188770 dataSize: 15299485696 basicStatsState: COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: bigint countDistincts: 1610 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= colName: _col2 colType: decimal(7,2) countDistincts: 2068 numNulls: 31639899 avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint countDistincts: 4 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

From trunk

2022-09-28T19:59:45,402 DEBUG [7c0dcbda-a89e-45cb-94f9-a85ccda010e5 HiveServer2-Handler-Pool: Thread-120]: annotation.StatsRulesProcFactory (:()) - [0] STATS-JOIN[8]:  numRows: 263995240 dataSize: 29994713856 basicStatsState: COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: bigint countDistincts: 2848 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= colName: _col2 colType: decimal(7,2) countDistincts: 3748 numNulls: 31639899 avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint countDistincts: 8 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

Does countDistincts play a role during join operation? I am also attaching the HS2 log from trunk code for better comparison.

Regards,
Brahma

From: Rajesh Balamohan <rb...@apache.org>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Monday, 3 October 2022 at 5:35 AM
To: "user@hive.apache.org" <us...@hive.apache.org>
Cc: "dev@hive.apache.org" <de...@hive.apache.org>
Subject: Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Based on the plan, filtered output in map-1 had mis-estimates and also groupby operators have large misestimates.

This is causing the number of reducers to be estimated as "4" which is less for this query. Due to the partition factor of tez, it ends up with 8 reducer slots at runtime for hive 3.x.

Here are a few tickets which can help, but note that it is very risky to backport pointed patches for stats and CBO without complete context. This may have an adverse impact on other queries.

https://issues.apache.org/jira/browse/HIVE-23684<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23684&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457516385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wid5oP8esFNYApmTYQ0AtVVsXZ1vyaLRMKjEZk8VLGo%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-20432<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-20432&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AZcLqHcduwJCNoSrPq3XULhuKyX1EueVrso7zNFFuWY%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-23485<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23485&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fEw7VqFhGZM52jdUN%2B4ZZf2bGdloWro2XLdPxwLr2i0%3D&reserved=0>

On Sun, Oct 2, 2022 at 1:56 PM Battula, Brahma Reddy <bb...@visa.com>> wrote:
+ Attaching the hs2 logs also.

From: "Battula, Brahma Reddy" <bb...@visa.com>>
Date: Sunday, 2 October 2022 at 2:16 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: TPCDS query degrade with hive-3.1.2 because of wrong estimation for reducers

Hi All,

We’ve ran TPCDS queries against hive-3.1.2 and trunk(little older version). (Attached files suffix “a” is trunk and “v” is 3.1.2)

The query execution time is higher in hive-3.1.2 as number of the reducers estimated is less (8) as compared to trunk version where it’s 46.

All the hive/tez/Yarn configs are same in both clusters. Even h/w resources are same. And query planner is also same.

The stats in reduce sink phase are not look same.

HIVE_TRUNK_CODE - 2022-09-26T05:58:23,786 INFO  [07243354-f941-419d-8908-45009762e67d HiveServer2-Handler-Pool: Thread-168]: optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize:   9628; Statistics: Num rows:  359 Data size: 4308  Basic stats: COMPLETE Column stats: COMPLETE
HIVE_3.1.2_CODE - 2022-09-27T03:39:45,116 INFO  [2fd1493c-f1a0-4874-acac-58f28e9c21ea HiveServer2-Handler-Pool: Thread-134]: optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize: 325856; Statistics: Num rows: 8116 Data size: 97392 Basic stats: COMPLETE Column stats: COMPLETE

Any idea how the reducers getting underestimated.?

Unsubscribe

Posted by Jagjeet Singh via user <us...@hive.apache.org>.