You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Dave Lerman (JIRA)" <ji...@apache.org> on 2010/01/01 21:03:54 UTC

[jira] Commented: (HIVE-1001) CombinedHiveInputFormat should parse the inputpath correctly

    [ https://issues.apache.org/jira/browse/HIVE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795788#action_12795788 ] 

Dave Lerman commented on HIVE-1001:
-----------------------------------

Okay, that makes sense then.  Without the patch, the two input pools have corrupt paths so the input files don't match either pool and get processed together in one pool of non-matching paths.  This yields one split and one mapper, so the merge step doesn't run (since there's only one output file).

With the patch, the pools get created correctly, so the two files are processed in separate pools, which yields two splits and two mappers, so the merge step runs.

Thanks for the help.

> CombinedHiveInputFormat should parse the inputpath correctly
> ------------------------------------------------------------
>
>                 Key: HIVE-1001
>                 URL: https://issues.apache.org/jira/browse/HIVE-1001
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Zheng Shao
>            Assignee: Namit Jain
>             Fix For: 0.5.0
>
>         Attachments: hive.1001.1.patch
>
>
> From David Lerman:
> "
> I'm running into errors where CombinedHiveInputFormat is combining data from
> two different tables which is causing problems because the tables have
> different input formats.
> It looks like the problem is in
> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
> CombineFileInputFormat.getInputPaths which returns the list of input paths
> and then chops off the first 5 characters to remove file: from the
> beginning, but the return value I'm getting from getInputPaths is actually
> hdfs://domain/path.  So then when it creates the pools using these paths,
> none of the input paths match the pools (since they're just the file path
> which protocol or domain).
> "
> We should use Path.getPath() to get the path part of an URI instead of just chopping off 5 chars.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.