You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Wei Zhang (JIRA)" <ji...@apache.org> on 2019/06/24 12:25:00 UTC

[jira] [Commented] (HIVE-21915) Hive with TEZ UNION ALL and UDTF results in data loss

    [ https://issues.apache.org/jira/browse/HIVE-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871119#comment-16871119 ] 

Wei Zhang commented on HIVE-21915:
----------------------------------

Just added a patch for this issue. Anybody help to review the code?

> Hive with TEZ UNION ALL and UDTF results in data loss
> -----------------------------------------------------
>
>                 Key: HIVE-21915
>                 URL: https://issues.apache.org/jira/browse/HIVE-21915
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1
>            Reporter: Wei Zhang
>            Assignee: Wei Zhang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: hive-21915-2019-06-24.patch
>
>
> The HQL syntax is like this:
> CREATE TEMPORARY TABLE tez_union_all_loss_data AS
> SELECT xxx, yyy, zzz,1 as tag
> FROM ods_1
> UNION ALL
> SELECT xxx, yyy, zzz, tag
> FROM
> (
> SELECT xxx
> ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
> ,zzz
> ,2 as tag
> FROM ods_2
> LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
> ) tbl 
> ;
>  
> With above HQL, we are expecting that rows with both tag = 2 and tag = 1 appear. In our case however, all the rows with tag = 1 are lost.
> Dig deeper we can find that the generated two maps have identical task tmp paths. And that results from when UDTF is present, the FileSinkOperator will be processed twice generating the tmp path in GenTezUtils.removeUnionOperators();
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)