You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2022/10/21 07:32:01 UTC

[jira] [Updated] (HIVE-16498) [Tez] ReduceRecordProcessor has no check to see if all the operators are done or not and is reading complete data

     [ https://issues.apache.org/jira/browse/HIVE-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stamatis Zampetakis updated HIVE-16498:
---------------------------------------
    Fix Version/s:     (was: 1.2.0)

I cleared the fixVersion field since this ticket is not resolved. Please review this ticket and if the fix is already committed to a specific version please set the version accordingly and mark the ticket as RESOLVED.

According to the JIRA guidelines (https://cwiki.apache.org/confluence/display/Hive/HowToContribute) the fixVersion should be set only when the issue is resolved/closed.

> [Tez] ReduceRecordProcessor has no check to see if all the operators are done or not and is reading complete data
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16498
>                 URL: https://issues.apache.org/jira/browse/HIVE-16498
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Adesh Kumar Rao
>            Priority: Major
>         Attachments: HIVE-16498.1.patch
>
>
> ReducerRecordProcessor is not checking if the reducer (Operator) is done or not and this causes reading of useless data.
> It can be reproduced by a reduce side join.
> The data for large_table is generated by following shell script and a table can be created from the file `large.txt`
> {code:java}
> for (( j=1 ; j <=20; j++))
> do
>   for (( i=1; i <= 1000000; i++ ))
>   do
>     echo "$i,$j" >> large.txt
>   done
> done
> {code}
> {code:java}
> create external table large_table ( i int, j int) row format delimited fields terminated by ',' location "hdfs://<some-hdfs-location>";
> set hive.auto.convert.join=false; -- So that reduce side join is used instead of MapJoin
> select * from large_table a join large_table b on a,j = b.j limit 100;
> {code}
> The above join query is stuck reading all the data from table (because of no check) and does not seem to finish in real time as compared to MR or even Tez with MapJoin enabled.
> For reference, the same query takes around 5-6 minutes on MR and 2-3 minutes in case of MapJoin on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)