You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Namit Jain <nj...@facebook.com> on 2010/02/01 23:31:14 UTC

RE: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

I will take a look -
It will be great if you can file a jira and add a patch for that

From: Roberto Congiu [mailto:roberto.congiu@openx.org]
Sent: Monday, February 01, 2010 11:02 AM
To: Namit Jain
Cc: hive-user@hadoop.apache.org
Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

Reviving this old thread...just found the time to work on this...
I have a patch for using MultiFIleInputFormat in hadoop 0.19 as CombineHiveInputFormat - setting
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
(or the equivalent setting on hive-site.xml) will have hive use MultiFIleInputFormat, packing many small files in
mapred.multifileinputformat.splits splits (if set), or guessing the size by dividing the total input size by the DFS block size.
Patch attached...I checked that it passes all unit tests according to http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29



On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <nj...@facebook.com>> wrote:
That's right




On 9/30/09 12:07 AM, "Roberto Congiu" <ro...@openx.org>> wrote:
Hi Namit,
that's what I thought. Right now unfortunately we can't migrate to 0.20.
I realize we lose data locality but as you said, it would still be
considerably better than now.

I had a look at the shim code, shouldn't be difficult since it would
be basically mimicking CombineFileInputFormat.

Once I add the appropriate logic to the shim, I have to set
hive.input.format to
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
actually use it, right ?

Roberto

2009/9/29 Namit Jain <nj...@facebook.com>>:
> Hi Roberto,
>
> Talked with Raghu and Dhruba - it is possible to do so using
> MutliFileInputFormat,
> But the performance will not be very good since MutliFileInputFormat does
> not
> provide any locality. However, it will still be much better than the problem
> you are
> running into right now.
>
> Can you move to hadoop-0.20 ? That might be simpler.
>
> If not, you can definitely implement the shim using MultiFileInputFormat for
> 0.19
> (which should work even with 0.17). Do you need some help in understanding
> the
> current shim code ?
>
> Thanks,
> -namit
>
>
>
>
>
> On 9/29/09 10:53 AM, "Namit Jain" <nj...@facebook.com>> wrote:
>
> Just checked - CombineFileInputFormat and a lot of other related stuff went
> to hadoop 0.20
> So, it would be very difficult to add this for 0.19
>
>
>
> From: Namit Jain [mailto:njain@facebook.com]
> Sent: Monday, September 28, 2009 10:30 PM
> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>; roberto.congiu@openx.org<ma...@openx.org>
> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>
> I am not sure whether CombineFileInputFormat (in hadoop) is available in
> 0.19 -
> If it is, we can add it, otherwise it will be very difficult.
>
>
>
> On 9/28/09 7:06 PM, "Raghu Murthy" <rm...@facebook.com>> wrote:
> Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
> hadoop-0.19?
>
> On 9/28/09 6:57 PM, "Roberto Congiu" <ro...@openx.org>> wrote:
>
>> Hi guys,
>> I've been working on integrating hive with a legacy file format we use
>> here. I wrote the appropriate InputFormat and SerDe and everything
>> works, but it's painfully slow.
>> The reason is that the files I am reading are many and hive uses one
>> mapper for every file.
>> I saw the HIVE-74 patches but those use CombineFileInputFormat which
>> is available on hadoop 0.20...but we use 0.19. Is there any reason the
>> same goal could not be achieved using the deprecated (but present  <
>> 0.20) MultiFileInputFormat ?
>>
>> Thanks,
>> Roberto
>
>
>


RE: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

Posted by Namit Jain <nj...@facebook.com>.
I filed a jira and merged it in yesterday. Currently, it is only for hadoop 19

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: Tuesday, February 02, 2010 8:24 AM
To: hive-user@hadoop.apache.org
Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

On Mon, Feb 1, 2010 at 5:31 PM, Namit Jain <nj...@facebook.com> wrote:
> I will take a look -
>
> It will be great if you can file a jira and add a patch for that
>
>
>
> From: Roberto Congiu [mailto:roberto.congiu@openx.org]
> Sent: Monday, February 01, 2010 11:02 AM
> To: Namit Jain
> Cc: hive-user@hadoop.apache.org
> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>
>
>
> Reviving this old thread...just found the time to work on this...
>
> I have a patch for using MultiFIleInputFormat in hadoop 0.19 as
> CombineHiveInputFormat - setting
>
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>
> (or the equivalent setting on hive-site.xml) will have hive use
> MultiFIleInputFormat, packing many small files in
>
> mapred.multifileinputformat.splits splits (if set), or guessing the size by
> dividing the total input size by the DFS block size.
>
> Patch attached...I checked that it passes all unit tests according
> to http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29
>
>
>
>
>
>
>
> On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <nj...@facebook.com> wrote:
>
> That's right
>
>
>
> On 9/30/09 12:07 AM, "Roberto Congiu" <ro...@openx.org> wrote:
>
> Hi Namit,
> that's what I thought. Right now unfortunately we can't migrate to 0.20.
> I realize we lose data locality but as you said, it would still be
> considerably better than now.
>
> I had a look at the shim code, shouldn't be difficult since it would
> be basically mimicking CombineFileInputFormat.
>
> Once I add the appropriate logic to the shim, I have to set
> hive.input.format to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
> actually use it, right ?
>
> Roberto
>
> 2009/9/29 Namit Jain <nj...@facebook.com>:
>> Hi Roberto,
>>
>> Talked with Raghu and Dhruba - it is possible to do so using
>> MutliFileInputFormat,
>> But the performance will not be very good since MutliFileInputFormat does
>> not
>> provide any locality. However, it will still be much better than the
>> problem
>> you are
>> running into right now.
>>
>> Can you move to hadoop-0.20 ? That might be simpler.
>>
>> If not, you can definitely implement the shim using MultiFileInputFormat
>> for
>> 0.19
>> (which should work even with 0.17). Do you need some help in understanding
>> the
>> current shim code ?
>>
>> Thanks,
>> -namit
>>
>>
>>
>>
>>
>> On 9/29/09 10:53 AM, "Namit Jain" <nj...@facebook.com> wrote:
>>
>> Just checked - CombineFileInputFormat and a lot of other related stuff
>> went
>> to hadoop 0.20
>> So, it would be very difficult to add this for 0.19
>>
>>
>>
>> From: Namit Jain [mailto:njain@facebook.com]
>> Sent: Monday, September 28, 2009 10:30 PM
>> To: hive-user@hadoop.apache.org; roberto.congiu@openx.org
>> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>>
>> I am not sure whether CombineFileInputFormat (in hadoop) is available in
>> 0.19 -
>> If it is, we can add it, otherwise it will be very difficult.
>>
>>
>>
>> On 9/28/09 7:06 PM, "Raghu Murthy" <rm...@facebook.com> wrote:
>> Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
>> hadoop-0.19?
>>
>> On 9/28/09 6:57 PM, "Roberto Congiu" <ro...@openx.org> wrote:
>>
>>> Hi guys,
>>> I've been working on integrating hive with a legacy file format we use
>>> here. I wrote the appropriate InputFormat and SerDe and everything
>>> works, but it's painfully slow.
>>> The reason is that the files I am reading are many and hive uses one
>>> mapper for every file.
>>> I saw the HIVE-74 patches but those use CombineFileInputFormat which
>>> is available on hadoop 0.20...but we use 0.19. Is there any reason the
>>> same goal could not be achieved using the deprecated (but present  <
>>> 0.20) MultiFileInputFormat ?
>>>
>>> Thanks,
>>> Roberto
>>
>>
>>
>
>

Has this been implemented for the 18 shims as well?

Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 1, 2010 at 5:31 PM, Namit Jain <nj...@facebook.com> wrote:
> I will take a look –
>
> It will be great if you can file a jira and add a patch for that
>
>
>
> From: Roberto Congiu [mailto:roberto.congiu@openx.org]
> Sent: Monday, February 01, 2010 11:02 AM
> To: Namit Jain
> Cc: hive-user@hadoop.apache.org
> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>
>
>
> Reviving this old thread...just found the time to work on this...
>
> I have a patch for using MultiFIleInputFormat in hadoop 0.19 as
> CombineHiveInputFormat - setting
>
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>
> (or the equivalent setting on hive-site.xml) will have hive use
> MultiFIleInputFormat, packing many small files in
>
> mapred.multifileinputformat.splits splits (if set), or guessing the size by
> dividing the total input size by the DFS block size.
>
> Patch attached...I checked that it passes all unit tests according
> to http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29
>
>
>
>
>
>
>
> On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <nj...@facebook.com> wrote:
>
> That’s right
>
>
>
> On 9/30/09 12:07 AM, "Roberto Congiu" <ro...@openx.org> wrote:
>
> Hi Namit,
> that's what I thought. Right now unfortunately we can't migrate to 0.20.
> I realize we lose data locality but as you said, it would still be
> considerably better than now.
>
> I had a look at the shim code, shouldn't be difficult since it would
> be basically mimicking CombineFileInputFormat.
>
> Once I add the appropriate logic to the shim, I have to set
> hive.input.format to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
> actually use it, right ?
>
> Roberto
>
> 2009/9/29 Namit Jain <nj...@facebook.com>:
>> Hi Roberto,
>>
>> Talked with Raghu and Dhruba – it is possible to do so using
>> MutliFileInputFormat,
>> But the performance will not be very good since MutliFileInputFormat does
>> not
>> provide any locality. However, it will still be much better than the
>> problem
>> you are
>> running into right now.
>>
>> Can you move to hadoop-0.20 ? That might be simpler.
>>
>> If not, you can definitely implement the shim using MultiFileInputFormat
>> for
>> 0.19
>> (which should work even with 0.17). Do you need some help in understanding
>> the
>> current shim code ?
>>
>> Thanks,
>> -namit
>>
>>
>>
>>
>>
>> On 9/29/09 10:53 AM, "Namit Jain" <nj...@facebook.com> wrote:
>>
>> Just checked – CombineFileInputFormat and a lot of other related stuff
>> went
>> to hadoop 0.20
>> So, it would be very difficult to add this for 0.19
>>
>>
>>
>> From: Namit Jain [mailto:njain@facebook.com]
>> Sent: Monday, September 28, 2009 10:30 PM
>> To: hive-user@hadoop.apache.org; roberto.congiu@openx.org
>> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>>
>> I am not sure whether CombineFileInputFormat (in hadoop) is available in
>> 0.19 -
>> If it is, we can add it, otherwise it will be very difficult.
>>
>>
>>
>> On 9/28/09 7:06 PM, "Raghu Murthy" <rm...@facebook.com> wrote:
>> Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
>> hadoop-0.19?
>>
>> On 9/28/09 6:57 PM, "Roberto Congiu" <ro...@openx.org> wrote:
>>
>>> Hi guys,
>>> I've been working on integrating hive with a legacy file format we use
>>> here. I wrote the appropriate InputFormat and SerDe and everything
>>> works, but it's painfully slow.
>>> The reason is that the files I am reading are many and hive uses one
>>> mapper for every file.
>>> I saw the HIVE-74 patches but those use CombineFileInputFormat which
>>> is available on hadoop 0.20...but we use 0.19. Is there any reason the
>>> same goal could not be achieved using the deprecated (but present  <
>>> 0.20) MultiFileInputFormat ?
>>>
>>> Thanks,
>>> Roberto
>>
>>
>>
>
>

Has this been implemented for the 18 shims as well?