You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Youjun Yuan (Jira)" <ji...@apache.org> on 2022/04/04 11:21:00 UTC

[jira] [Commented] (HIVE-25863) join result is null

    [ https://issues.apache.org/jira/browse/HIVE-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516752#comment-17516752 ] 

Youjun Yuan commented on HIVE-25863:
------------------------------------

I hit a similar issue here https://issues.apache.org/jira/browse/HIVE-26111, full join returns wrong value, while if set the reduces=1, than the result is correct.

> join result is null
> -------------------
>
>                 Key: HIVE-25863
>                 URL: https://issues.apache.org/jira/browse/HIVE-25863
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.1.2
>         Environment: hadoop 3.2.1
> hive 3.1.2
>            Reporter: zengxl
>            Priority: Blocker
>
> When I change the number of Reduce, the query result will change.Either inner join or left join will appear.Partial join results for the {color:#de350b}third table{color}(pdwd.test_sds_2021_12_21_shuffle_1_new) are {color:#de350b}null{color}.When there is only one Reduce, the results are all correct.
> when set hive.exec.reducers.bytes.per.reducer=256000 only one reduce;
> when set hive.exec.reducers.bytes.per.reducer=2560 has four reduce
> Here is my SQL and data
> {code:java}
> CREATE TABLE pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112(
>   tbl_id bigint COMMENT 'TBL_ID', 
>   tbl_create_time bigint COMMENT 'TBL_CREATE_TIME', 
>   db_id bigint COMMENT 'DB_ID', 
>   tbl_last_access_time bigint COMMENT 'TBL_LAST_ACCESS_TIME', 
>   owner string COMMENT 'OWNER', 
>   retention bigint COMMENT 'RETENTION', 
>   sd_id bigint COMMENT 'SD_ID', 
>   tbl_name string COMMENT 'TBL_NAME', 
>   tbl_type string COMMENT 'TBL_TYPE', 
>   view_expanded_text string COMMENT 'VIEW_EXPANDED_TEXT', 
>   view_original_text string COMMENT 'VIEW_ORIGINAL_TEXT', 
>   is_rewrite_enabled bigint COMMENT 'IS_REWRITE_ENABLED', 
>   tbl_owner_type string COMMENT 'TBL_OWNER_TYPE', 
>   cd_id bigint COMMENT 'CD_ID', 
>   input_format string COMMENT 'INPUT_FORMAT', 
>   is_compressed bigint COMMENT 'IS_COMPRESSED', 
>   is_storedassubdirectories bigint COMMENT 'IS_STOREDASSUBDIRECTORIES', 
>   tbl_or_part_location string COMMENT 'tbl_or_part_location', 
>   num_buckets bigint COMMENT 'NUM_BUCKETS', 
>   output_format string COMMENT 'OUTPUT_FORMAT', 
>   serde_id bigint COMMENT 'SERDE_ID', 
>   part_id bigint COMMENT 'PART_ID', 
>   part_create_time bigint COMMENT 'PART_CREATE_TIME', 
>   part_last_access_time bigint COMMENT 'PART_LAST_ACCESS_TIME', 
>   part_name string COMMENT 'PART_NAME')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
>   
> CREATE TABLE pdwd.test_partitions_2021_12_21_shuffle_1(
>   part_id bigint, 
>   create_time bigint, 
>   last_access_time bigint, 
>   part_name string, 
>   sd_id bigint, 
>   tbl_id bigint)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
>   
> CREATE TABLE pdwd.test_tbls_2021_12_21_shuffle(
>   tbl_id bigint, 
>   create_time bigint, 
>   db_id bigint, 
>   last_access_time bigint, 
>   owner string, 
>   retention bigint, 
>   sd_id bigint, 
>   tbl_name string, 
>   tbl_type string, 
>   view_expanded_text string, 
>   view_original_text string, 
>   is_rewrite_enabled bigint, 
>   owner_type string)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
>   
> CREATE TABLE pdwd.test_sds_2021_12_21_shuffle_1_new(
>   sd_id bigint, 
>   cd_id bigint, 
>   input_format string, 
>   is_compressed bigint, 
>   is_storedassubdirectories bigint, 
>   _c5 string, 
>   num_buckets bigint, 
>   output_format string, 
>   serde_id bigint)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
> set hive.stats.column.autogather=false;
> set hive.exec.reducers.bytes.per.reducer=2560;
> set hive.auto.convert.join=false;  
> insert overwrite table pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112
> select
> a.tbl_id,
> b.create_time as tbl_create_time,
> b.db_id,
> b.last_access_time as tbl_last_access_time,
> b.owner,
> b.retention,
> a.sd_id,
> b.tbl_name,
> b.tbl_type,
> b.view_expanded_text,
> b.view_original_text,
> b.is_rewrite_enabled,
> b.owner_type as tbl_owner_type,
> d.cd_id,
> d.input_format,
> d.is_compressed,
> d.is_storedassubdirectories,
> d.tbl_location,
> d.num_buckets,
> d.output_format,
> d.serde_id,
> a.part_id,
> a.create_time as part_create_time,
> a.last_access_time as part_last_access_time,
> a.part_name
> from pdwd.test_partitions_2021_12_21_shuffle_1 a
> left join pdwd.test_tbls_2021_12_21_shuffle b on a.tbl_id=b.tbl_id
> left join pdwd.test_sds_2021_12_21_shuffle_1_new d on a.sd_id=d.sd_id; {code}
> Execution error result:
> {code:java}
> select count(*) from pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112 where output_format is  null;
> Query ID = op_20220112153733_5c3793c8-c0e0-4dda-9212-239b5bd66f19
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> 2022-01-12 15:37:35,019 | INFO | org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider | Failing over to rm2
> Starting Job = job_1609738754049_4759167, Tracking URL = http://nm-bigdata-030049043.ctc.local:8088/proxy/application_1609738754049_4759167/
> Kill Command = /usr/local/hadoop3/bin/mapred job  -kill job_1609738754049_4759167
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2022-01-12 15:37:46,636 Stage-1 map = 0%,  reduce = 0%
> 2022-01-12 15:37:56,921 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.29 sec
> 2022-01-12 15:38:06,183 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.74 sec
> MapReduce Total cumulative CPU time: 8 seconds 740 msec
> Ended Job = job_1609738754049_4759167
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.74 sec   HDFS Read: 35145 HDFS Write: 196 SUCCESS
> Total MapReduce CPU Time Spent: 8 seconds 740 msec
> OK
> 70 {code}
> Changing the Number of Reduce tasks  set hive.exec.reducers.bytes.per.reducer=256000;
> {code:java}
> set hive.stats.column.autogather=false;
> set hive.exec.reducers.bytes.per.reducer=256000;
> set hive.auto.convert.join=false;  
> insert overwrite table pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112
> select
> a.tbl_id,
> b.create_time as tbl_create_time,
> b.db_id,
> b.last_access_time as tbl_last_access_time,
> b.owner,
> b.retention,
> a.sd_id,
> b.tbl_name,
> b.tbl_type,
> b.view_expanded_text,
> b.view_original_text,
> b.is_rewrite_enabled,
> b.owner_type as tbl_owner_type,
> d.cd_id,
> d.input_format,
> d.is_compressed,
> d.is_storedassubdirectories,
> d.tbl_location,
> d.num_buckets,
> d.output_format,
> d.serde_id,
> a.part_id,
> a.create_time as part_create_time,
> a.last_access_time as part_last_access_time,
> a.part_name
> from pdwd.test_partitions_2021_12_21_shuffle_1 a
> left join pdwd.test_tbls_2021_12_21_shuffle b on a.tbl_id=b.tbl_id
> left join pdwd.test_sds_2021_12_21_shuffle_1_new d on a.sd_id=d.sd_id; {code}
> Execution result is correct
>  
> {code:java}
> select count(*) from pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112 where output_format is  null;
> Query ID = op_20220112154536_a607937d-3457-44a2-9b27-a955d67dfec7
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> 2022-01-12 15:45:40,045 | INFO | org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider | Failing over to rm2
> Starting Job = job_1609738754049_4759214, Tracking URL = http://nm-bigdata-030049043.ctc.local:8088/proxy/application_1609738754049_4759214/
> Kill Command = /usr/local/hadoop3/bin/mapred job  -kill job_1609738754049_4759214
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2022-01-12 15:46:02,957 Stage-1 map = 0%,  reduce = 0%
> 2022-01-12 15:46:14,343 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.3 sec
> 2022-01-12 15:46:23,618 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.55 sec
> MapReduce Total cumulative CPU time: 7 seconds 550 msec
> Ended Job = job_1609738754049_4759214
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.55 sec   HDFS Read: 27631 HDFS Write: 195 SUCCESS
> Total MapReduce CPU Time Spent: 7 seconds 550 msec
> OK
> 0 {code}
> when join pdwd.test_sds_2021_12_21_shuffle_1_new change {color:#de350b}left {color}join to {color:#de350b}inner {color}join,execution error result:
>  
>  
> {code:java}
> set hive.stats.column.autogather=false;
> set hive.exec.reducers.bytes.per.reducer=2560;
> set hive.auto.convert.join=false;  
> insert overwrite table pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112
> select
> a.tbl_id,
> b.create_time as tbl_create_time,
> b.db_id,
> b.last_access_time as tbl_last_access_time,
> b.owner,
> b.retention,
> a.sd_id,
> b.tbl_name,
> b.tbl_type,
> b.view_expanded_text,
> b.view_original_text,
> b.is_rewrite_enabled,
> b.owner_type as tbl_owner_type,
> d.cd_id,
> d.input_format,
> d.is_compressed,
> d.is_storedassubdirectories,
> d.tbl_location,
> d.num_buckets,
> d.output_format,
> d.serde_id,
> a.part_id,
> a.create_time as part_create_time,
> a.last_access_time as part_last_access_time,
> a.part_name
> from pdwd.test_partitions_2021_12_21_shuffle_1 a
> left join pdwd.test_tbls_2021_12_21_shuffle b on a.tbl_id=b.tbl_id
> inner join pdwd.test_sds_2021_12_21_shuffle_1_new d on a.sd_id=d.sd_id; {code}
> {color:#de350b}inner {color}join error result:
>  
> {code:java}
> set hive.compute.query.using.stats=false; 
> select count(*) from pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112;
> Query ID = op_20220112155120_f65d0552-df18-4221-9c31-e1ac21e551f0
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> 2022-01-12 15:51:21,254 | INFO | org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider | Failing over to rm2
> Starting Job = job_1609738754049_4759248, Tracking URL = http://nm-bigdata-030049043.ctc.local:8088/proxy/application_1609738754049_4759248/
> Kill Command = /usr/local/hadoop3/bin/mapred job  -kill job_1609738754049_4759248
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2022-01-12 15:51:29,530 Stage-1 map = 0%,  reduce = 0%
> 2022-01-12 15:51:38,743 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.69 sec
> 2022-01-12 15:51:46,950 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.51 sec
> MapReduce Total cumulative CPU time: 6 seconds 510 msec
> Ended Job = job_1609738754049_4759248
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.51 sec   HDFS Read: 30277 HDFS Write: 196 SUCCESS
> Total MapReduce CPU Time Spent: 6 seconds 510 msec
> OK
> 23{code}
> {color:#de350b}inner join {color}:Changing the Number of Reduce tasks  set hive.exec.reducers.bytes.per.reducer=256000;
> {code:java}
> set hive.stats.column.autogather=false;
> set hive.exec.reducers.bytes.per.reducer=256000;
> set hive.auto.convert.join=false;  
> insert overwrite table pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112
> select
> a.tbl_id,
> b.create_time as tbl_create_time,
> b.db_id,
> b.last_access_time as tbl_last_access_time,
> b.owner,
> b.retention,
> a.sd_id,
> b.tbl_name,
> b.tbl_type,
> b.view_expanded_text,
> b.view_original_text,
> b.is_rewrite_enabled,
> b.owner_type as tbl_owner_type,
> d.cd_id,
> d.input_format,
> d.is_compressed,
> d.is_storedassubdirectories,
> d.tbl_location,
> d.num_buckets,
> d.output_format,
> d.serde_id,
> a.part_id,
> a.create_time as part_create_time,
> a.last_access_time as part_last_access_time,
> a.part_name
> from pdwd.test_partitions_2021_12_21_shuffle_1 a
> left join pdwd.test_tbls_2021_12_21_shuffle b on a.tbl_id=b.tbl_id
> inner join pdwd.test_sds_2021_12_21_shuffle_1_new d on a.sd_id=d.sd_id; {code}
> result is correct:
> {code:java}
> set hive.compute.query.using.stats=false;  select count(*) from pdwd.hive_ah3_metastore_tbl_partitions_sds_test_20220112;
> Query ID = op_20220112155608_d146f24e-5515-42c0-a4b7-3217e5de9f47
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> 2022-01-12 15:56:09,652 | INFO | org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider | Failing over to rm2
> Starting Job = job_1609738754049_4759266, Tracking URL = http://nm-bigdata-030049043.ctc.local:8088/proxy/application_1609738754049_4759266/
> Kill Command = /usr/local/hadoop3/bin/mapred job  -kill job_1609738754049_4759266
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2022-01-12 15:56:17,838 Stage-1 map = 0%,  reduce = 0%
> 2022-01-12 15:56:30,150 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.08 sec
> 2022-01-12 15:56:39,391 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.0 sec
> MapReduce Total cumulative CPU time: 9 seconds 0 msec
> Ended Job = job_1609738754049_4759266
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 9.0 sec   HDFS Read: 26377 HDFS Write: 196 SUCCESS
> Total MapReduce CPU Time Spent: 9 seconds 0 msec
> OK
> 93{code}
> Either inner join or left join .By changing the number of reduces, I found that about a fraction of the results were correct  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)