You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by David Lerman <dl...@videoegg.com> on 2009/12/15 01:43:49 UTC

CombinedHiveInputFormat combining across tables

I'm running into errors where CombinedHiveInputFormat is combining data from
two different tables which is causing problems because the tables have
different input formats.

It looks like the problem is in
org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
CombineFileInputFormat.getInputPaths which returns the list of input paths
and then chops off the first 5 characters to remove file: from the
beginning, but the return value I'm getting from getInputPaths is actually
hdfs://domain/path.  So then when it creates the pools using these paths,
none of the input paths match the pools (since they're just the file path
which protocol or domain).

Any suggestions?

Thanks!

RE: CombinedHiveInputFormat combining across tables

Posted by xiaohexiaohe <wx...@hotmail.com>.

sorry! we are not using the HIVE for now
 
> Date: Sun, 20 Dec 2009 23:44:34 -0800
> Subject: Re: CombinedHiveInputFormat combining across tables
> From: zshao9@gmail.com
> To: hive-user@hadoop.apache.org
> 
> Sorry about the delay.
> 
> Are you using Hive trunk?
> 
> Filed https://issues.apache.org/jira/browse/HIVE-1001
> We should use (new Path(str)).getPath() instead of chopping off the
> first 5 chars.
> 
> Zheng
> 
> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <dl...@videoegg.com> wrote:
> > I'm running into errors where CombinedHiveInputFormat is combining data from
> > two different tables which is causing problems because the tables have
> > different input formats.
> >
> > It looks like the problem is in
> > org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
> > CombineFileInputFormat.getInputPaths which returns the list of input paths
> > and then chops off the first 5 characters to remove file: from the
> > beginning, but the return value I'm getting from getInputPaths is actually
> > hdfs://domain/path.  So then when it creates the pools using these paths,
> > none of the input paths match the pools (since they're just the file path
> > which protocol or domain).
> >
> > Any suggestions?
> >
> > Thanks!
> >
> >
> 
> 
> 
> -- 
> Yours,
> Zheng
 		 	   		  
_________________________________________________________________
上Windows Live 中国首页，下载Messenger2009安全版！
http://www.windowslive.cn

Re: CombinedHiveInputFormat combining across tables

Posted by David Lerman <dl...@videoegg.com>.

Thanks Namit.  Filed as HIVE-1006 and HIVE-1007.


On 12/22/09 12:36 AM, "Namit Jain" <nj...@facebook.com> wrote:

> Thanks David,
> It would be very useful if you can file jiras and patches for the same.
> 
> 
> Thanks,
> -namit
> 
> 
> On 12/21/09 6:58 PM, "David Lerman" <dl...@videoegg.com> wrote:
> 
>> Thanks Zheng.  We're using trunk, r888452.
>> 
>> We actually ended up making three changes to CombineHiveInputFormat.java to
>> get it working in our environment.  If these aren't known issues, let me
>> know and I can file bugs and patches in Jira.
>> 
>> 1.  The issue mentioned below.  Along the lines you mentioned, we fixed it
>> by changing:
>> 
>> combine.createPool(job, new CombineFilter(paths[i]));
>> 
>> to:
>> 
>> combine.createPool(job, new CombineFilter(new
>> Path(paths[i].toUri().getPath())));
>> 
>> and then getting rid of the code that strips the "file:" in
>> Hadoop20Shims.getInputPathsShim and having it just call
>> CombineFileInputFormat.getInputPaths(job);
>> 
>> 2.  When HiveInputFormat.getPartitionDescFromPath was called from
>> CombineHiveInputFormat, it was sometimes failing to return a matching
>> partitionDesc which then caused an Exception down the line since the split
>> didn't have an inputFormatClassName.  The issue was that the path format
>> used as the key in pathToPartitionInfo varies between stage - in the first
>> stage it was the complete path as returned from the table definitions (eg.
>> hdfs://server/path), and then in subsequent stages, it was the complete path
>> with port (eg. hdfs://server:8020/path) of the result of the previous stage.
>> This isn't a problem in HiveInputFormat since the directory you're looking
>> up always uses the same format as the keys, but in CombineHiveInputFormat,
>> you take that path and look up its children in the file system to get all
>> the block information, and then use one of the returned paths to get the
>> partition info -- and that returned path does not include the port.  So, in
>> any stage after the first, we were looking for a path without the port, but
>> all the keys in the map contained a port, so we didn't find anything.
>> 
>> Since I didn't fully understand the logic for when the port was included in
>> the path and when it wasn't, my hack fix was just to give
>> CombineHiveInputFormat its own implementation of getPartitionDescFromPath
>> which just walks through partitionDesc and compares using just the path:
>> 
>> protected static partitionDesc getPartitionDescFromPath(Map<String,
>> partitionDesc> pathToPartitionInfo, Path dir)
>> throws IOException {
>>   for (Map.Entry<String, partitionDesc> entry :
>> pathToPartitionInfo.entrySet()) {
>>     try {
>>       if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
>>         return entry.getValue();
>>       }
>>     } catch (URISyntaxException e2) {}
>>   }
>>   throw new IOException("cannot find dir = " + dir.toString()
>> 
>> + " in partToPartitionInfo!");
>> }
>> 
>> 3. In a multi-stage query, when one stage returned no data (resulting in a
>> bunch of output files with size 0), the next stage would hang in Hadoop
>> because it would have 0 mappers in the job definition.  The issue was that
>> CombineHiveInputFormat would look for blocks, find none, and return 0 splits
>> which would hang Hadoop.  There may be good a way to just skip that job
>> altogether, but as a quick hack to get it working, when there were no
>> splits, I'd just create a single empty one so that the job wouldn't hang: at
>> the end of getSplits, I just added:
>> 
>> if (result.size() == 0) {
>>   Path firstChild =
>>     paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();
>> 
>>   CombineFileSplit emptySplit = new CombineFileSplit(
>>     job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
>>     new String[0]);
>>   FixedCombineHiveInputSplit emptySplitWrapper =
>>     new FixedCombineHiveInputSplit(job,
>>     newHadoop20Shims.InputSplitShim(emptySplit));
>> 
>>   result.add(emptySplitWrapper);
>> }
>> 
>> With those three changes, it's working beautifully -- some of our queries
>> which previously had thousands of mappers loading very small data files now
>> have a hundred or so and are running about 10x faster.  Many thanks for the
>> new functionality!
>> 
>> On 12/21/09 2:44 AM, "Zheng Shao" <zs...@gmail.com> wrote:
>> 
>>>> Sorry about the delay.
>>>> 
>>>> Are you using Hive trunk?
>>>> 
>>>> Filed https://issues.apache.org/jira/browse/HIVE-1001
>>>> We should use (new Path(str)).getPath() instead of chopping off the
>>>> first 5 chars.
>>>> 
>>>> Zheng
>>>> 
>>>> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <dl...@videoegg.com> wrote:
>>>>>> I'm running into errors where CombinedHiveInputFormat is combining data
>>>>>> >>> from
>>>>>> two different tables which is causing problems because the tables have
>>>>>> different input formats.
>>>>>> 
>>>>>> It looks like the problem is in
>>>>>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
>>>>>> CombineFileInputFormat.getInputPaths which returns the list of input
>>>> paths
>>>>>> and then chops off the first 5 characters to remove file: from the
>>>>>> beginning, but the return value I'm getting from getInputPaths is
>>>> actually
>>>>>> hdfs://domain/path.  So then when it creates the pools using these paths,
>>>>>> none of the input paths match the pools (since they're just the file path
>>>>>> which protocol or domain).
>>>>>> 
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Yours,
>>>> Zheng
>> 
>>

Re: CombinedHiveInputFormat combining across tables

Posted by Namit Jain <nj...@facebook.com>.

Thanks David,
It would be very useful if you can file jiras and patches for the same.

Thanks,
-namit

On 12/21/09 6:58 PM, "David Lerman" <dl...@videoegg.com> wrote:

Thanks Zheng.  We're using trunk, r888452.

We actually ended up making three changes to CombineHiveInputFormat.java to
get it working in our environment.  If these aren't known issues, let me
know and I can file bugs and patches in Jira.

1.  The issue mentioned below.  Along the lines you mentioned, we fixed it
by changing:

combine.createPool(job, new CombineFilter(paths[i]));

to:

combine.createPool(job, new CombineFilter(new
Path(paths[i].toUri().getPath())));

and then getting rid of the code that strips the "file:" in
Hadoop20Shims.getInputPathsShim and having it just call
CombineFileInputFormat.getInputPaths(job);

2.  When HiveInputFormat.getPartitionDescFromPath was called from
CombineHiveInputFormat, it was sometimes failing to return a matching
partitionDesc which then caused an Exception down the line since the split
didn't have an inputFormatClassName.  The issue was that the path format
used as the key in pathToPartitionInfo varies between stage - in the first
stage it was the complete path as returned from the table definitions (eg.
hdfs://server/path), and then in subsequent stages, it was the complete path
with port (eg. hdfs://server:8020/path) of the result of the previous stage.
This isn't a problem in HiveInputFormat since the directory you're looking
up always uses the same format as the keys, but in CombineHiveInputFormat,
you take that path and look up its children in the file system to get all
the block information, and then use one of the returned paths to get the
partition info -- and that returned path does not include the port.  So, in
any stage after the first, we were looking for a path without the port, but
all the keys in the map contained a port, so we didn't find anything.

Since I didn't fully understand the logic for when the port was included in
the path and when it wasn't, my hack fix was just to give
CombineHiveInputFormat its own implementation of getPartitionDescFromPath
which just walks through partitionDesc and compares using just the path:

protected static partitionDesc getPartitionDescFromPath(Map<String,
partitionDesc> pathToPartitionInfo, Path dir)
throws IOException {
  for (Map.Entry<String, partitionDesc> entry :
pathToPartitionInfo.entrySet()) {
    try {
      if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
        return entry.getValue();
      }
    } catch (URISyntaxException e2) {}
  }
  throw new IOException("cannot find dir = " + dir.toString()

+ " in partToPartitionInfo!");
}

3. In a multi-stage query, when one stage returned no data (resulting in a
bunch of output files with size 0), the next stage would hang in Hadoop
because it would have 0 mappers in the job definition.  The issue was that
CombineHiveInputFormat would look for blocks, find none, and return 0 splits
which would hang Hadoop.  There may be good a way to just skip that job
altogether, but as a quick hack to get it working, when there were no
splits, I'd just create a single empty one so that the job wouldn't hang: at
the end of getSplits, I just added:

if (result.size() == 0) {
  Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

  CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
  FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

  result.add(emptySplitWrapper);
}

With those three changes, it's working beautifully -- some of our queries
which previously had thousands of mappers loading very small data files now
have a hundred or so and are running about 10x faster.  Many thanks for the
new functionality!

On 12/21/09 2:44 AM, "Zheng Shao" <zs...@gmail.com> wrote:

> Sorry about the delay.
>
> Are you using Hive trunk?
>
> Filed https://issues.apache.org/jira/browse/HIVE-1001
> We should use (new Path(str)).getPath() instead of chopping off the
> first 5 chars.
>
> Zheng
>
> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <dl...@videoegg.com> wrote:
>> I'm running into errors where CombinedHiveInputFormat is combining data from
>> two different tables which is causing problems because the tables have
>> different input formats.
>>
>> It looks like the problem is in
>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
>> CombineFileInputFormat.getInputPaths which returns the list of input paths
>> and then chops off the first 5 characters to remove file: from the
>> beginning, but the return value I'm getting from getInputPaths is actually
>> hdfs://domain/path.  So then when it creates the pools using these paths,
>> none of the input paths match the pools (since they're just the file path
>> which protocol or domain).
>>
>> Any suggestions?
>>
>> Thanks!
>>
>>
>
>
>
> --
> Yours,
> Zheng

Re: CombinedHiveInputFormat combining across tables

Posted by David Lerman <dl...@videoegg.com>.

Thanks Zheng.  We're using trunk, r888452.

We actually ended up making three changes to CombineHiveInputFormat.java to
get it working in our environment.  If these aren't known issues, let me
know and I can file bugs and patches in Jira.

1.  The issue mentioned below.  Along the lines you mentioned, we fixed it
by changing:

combine.createPool(job, new CombineFilter(paths[i]));

to:

combine.createPool(job, new CombineFilter(new
Path(paths[i].toUri().getPath())));

and then getting rid of the code that strips the "file:" in
Hadoop20Shims.getInputPathsShim and having it just call
CombineFileInputFormat.getInputPaths(job);

2.  When HiveInputFormat.getPartitionDescFromPath was called from
CombineHiveInputFormat, it was sometimes failing to return a matching
partitionDesc which then caused an Exception down the line since the split
didn't have an inputFormatClassName.  The issue was that the path format
used as the key in pathToPartitionInfo varies between stage - in the first
stage it was the complete path as returned from the table definitions (eg.
hdfs://server/path), and then in subsequent stages, it was the complete path
with port (eg. hdfs://server:8020/path) of the result of the previous stage.
This isn't a problem in HiveInputFormat since the directory you're looking
up always uses the same format as the keys, but in CombineHiveInputFormat,
you take that path and look up its children in the file system to get all
the block information, and then use one of the returned paths to get the
partition info -- and that returned path does not include the port.  So, in
any stage after the first, we were looking for a path without the port, but
all the keys in the map contained a port, so we didn't find anything.

Since I didn't fully understand the logic for when the port was included in
the path and when it wasn't, my hack fix was just to give
CombineHiveInputFormat its own implementation of getPartitionDescFromPath
which just walks through partitionDesc and compares using just the path:

protected static partitionDesc getPartitionDescFromPath(Map<String,
partitionDesc> pathToPartitionInfo, Path dir)
throws IOException {
  for (Map.Entry<String, partitionDesc> entry :
pathToPartitionInfo.entrySet()) {
    try {
      if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
        return entry.getValue();
      }
    } catch (URISyntaxException e2) {}
  }
  throw new IOException("cannot find dir = " + dir.toString()

+ " in partToPartitionInfo!");
}

3. In a multi-stage query, when one stage returned no data (resulting in a
bunch of output files with size 0), the next stage would hang in Hadoop
because it would have 0 mappers in the job definition.  The issue was that
CombineHiveInputFormat would look for blocks, find none, and return 0 splits
which would hang Hadoop.  There may be good a way to just skip that job
altogether, but as a quick hack to get it working, when there were no
splits, I'd just create a single empty one so that the job wouldn't hang: at
the end of getSplits, I just added:

if (result.size() == 0) {
  Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

  CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
  FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

  result.add(emptySplitWrapper);
}

With those three changes, it's working beautifully -- some of our queries
which previously had thousands of mappers loading very small data files now
have a hundred or so and are running about 10x faster.  Many thanks for the
new functionality!

On 12/21/09 2:44 AM, "Zheng Shao" <zs...@gmail.com> wrote:

> Sorry about the delay.
> 
> Are you using Hive trunk?
> 
> Filed https://issues.apache.org/jira/browse/HIVE-1001
> We should use (new Path(str)).getPath() instead of chopping off the
> first 5 chars.
> 
> Zheng
> 
> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <dl...@videoegg.com> wrote:
>> I'm running into errors where CombinedHiveInputFormat is combining data from
>> two different tables which is causing problems because the tables have
>> different input formats.
>> 
>> It looks like the problem is in
>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
>> CombineFileInputFormat.getInputPaths which returns the list of input paths
>> and then chops off the first 5 characters to remove file: from the
>> beginning, but the return value I'm getting from getInputPaths is actually
>> hdfs://domain/path.  So then when it creates the pools using these paths,
>> none of the input paths match the pools (since they're just the file path
>> which protocol or domain).
>> 
>> Any suggestions?
>> 
>> Thanks!
>> 
>> 
> 
> 
> 
> --
> Yours,
> Zheng

Re: CombinedHiveInputFormat combining across tables

Posted by Zheng Shao <zs...@gmail.com>.

Sorry about the delay.

Are you using Hive trunk?

Filed https://issues.apache.org/jira/browse/HIVE-1001
We should use (new Path(str)).getPath() instead of chopping off the
first 5 chars.

Zheng

On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <dl...@videoegg.com> wrote:
> I'm running into errors where CombinedHiveInputFormat is combining data from
> two different tables which is causing problems because the tables have
> different input formats.
>
> It looks like the problem is in
> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
> CombineFileInputFormat.getInputPaths which returns the list of input paths
> and then chops off the first 5 characters to remove file: from the
> beginning, but the return value I'm getting from getInputPaths is actually
> hdfs://domain/path.  So then when it creates the pools using these paths,
> none of the input paths match the pools (since they're just the file path
> which protocol or domain).
>
> Any suggestions?
>
> Thanks!
>
>



-- 
Yours,
Zheng