You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Adam Szita (JIRA)" <ji...@apache.org> on 2018/11/20 12:51:00 UTC

[jira] [Commented] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

    [ https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693177#comment-16693177 ] 

Adam Szita commented on HIVE-20330:
-----------------------------------

Attached [^HIVE-20330.0.patch]

The change in this patch is that we're not just serializing and putting one InputJobInfo into JobConf, but rather always append to a list (or create it on the first occurrence) of InputJobInfo instances in it.
This ensures that if multiple tables serve as inputs in a job, Pig can retrieve information for each of the tables, not just the last one added.

I've also discovered a bug in InputJobInfo.writeObject() where the ObjectOutputStream was closed by mistake after writing partition information in a compressed manner. Closing the compressed writer inevitably closed the OOS on the context and prevented any other objects to be written into OOS - I had to fix that because it prevented serializing InputJobInfo instances inside a list.

> HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-20330
>                 URL: https://issues.apache.org/jira/browse/HIVE-20330
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: HIVE-20330.0.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only one table's information (InputJobInfo instance) gets tracked in the JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's information will be considered when Pig calls {{getStatistics}} to calculate and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig will query the size information from HCat for both of them, but it will either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)