You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Roberto Congiu <ro...@openx.org> on 2010/02/01 20:13:47 UTC

HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop

Hi guys,
I wrote a patch to use MultiFIleInputFOrmat in hadoop 0.19 where
CombineInputFormat is not available.
We needed this for a system that had too many small files, which causes too
many mappers/reducers,
which causes overhead...

---------- Forwarded message ----------
From: Roberto Congiu <ro...@openx.org>
Date: Mon, Feb 1, 2010 at 11:02 AM
Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
To: Namit Jain <nj...@facebook.com>
Cc: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>


Reviving this old thread...just found the time to work on this...
I have a patch for using MultiFIleInputFormat in hadoop 0.19 as
CombineHiveInputFormat - setting
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
(or the equivalent setting on hive-site.xml) will have hive use
MultiFIleInputFormat, packing many small files in
mapred.multifileinputformat.splits splits (if set), or guessing the size by
dividing the total input size by the DFS block size.
Patch attached...I checked that it passes all unit tests according to
http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29



On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <nj...@facebook.com> wrote:

>  That�s right
>
>
>
>
> On 9/30/09 12:07 AM, "Roberto Congiu" <ro...@openx.org> wrote:
>
> Hi Namit,
> that's what I thought. Right now unfortunately we can't migrate to 0.20.
> I realize we lose data locality but as you said, it would still be
> considerably better than now.
>
> I had a look at the shim code, shouldn't be difficult since it would
> be basically mimicking CombineFileInputFormat.
>
> Once I add the appropriate logic to the shim, I have to set
> hive.input.format to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
> actually use it, right ?
>
> Roberto
>
> 2009/9/29 Namit Jain <nj...@facebook.com>:
> > Hi Roberto,
> >
> > Talked with Raghu and Dhruba � it is possible to do so using
> > MutliFileInputFormat,
> > But the performance will not be very good since MutliFileInputFormat does
> > not
> > provide any locality. However, it will still be much better than the
> problem
> > you are
> > running into right now.
> >
> > Can you move to hadoop-0.20 ? That might be simpler.
> >
> > If not, you can definitely implement the shim using MultiFileInputFormat
> for
> > 0.19
> > (which should work even with 0.17). Do you need some help in
> understanding
> > the
> > current shim code ?
> >
> > Thanks,
> > -namit
> >
> >
> >
> >
> >
> > On 9/29/09 10:53 AM, "Namit Jain" <nj...@facebook.com> wrote:
> >
> > Just checked � CombineFileInputFormat and a lot of other related stuff
> went
> > to hadoop 0.20
> > So, it would be very difficult to add this for 0.19
> >
> >
> >
> > From: Namit Jain [mailto:njain@facebook.com] <njain@facebook.com]>
> > Sent: Monday, September 28, 2009 10:30 PM
> > To: hive-user@hadoop.apache.org; roberto.congiu@openx.org
> > Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
> >
> > I am not sure whether CombineFileInputFormat (in hadoop) is available in
> > 0.19 -
> > If it is, we can add it, otherwise it will be very difficult.
> >
> >
> >
> > On 9/28/09 7:06 PM, "Raghu Murthy" <rm...@facebook.com> wrote:
> > Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
> > hadoop-0.19?
> >
> > On 9/28/09 6:57 PM, "Roberto Congiu" <ro...@openx.org> wrote:
> >
> >> Hi guys,
> >> I've been working on integrating hive with a legacy file format we use
> >> here. I wrote the appropriate InputFormat and SerDe and everything
> >> works, but it's painfully slow.
> >> The reason is that the files I am reading are many and hive uses one
> >> mapper for every file.
> >> I saw the HIVE-74 patches but those use CombineFileInputFormat which
> >> is available on hadoop 0.20...but we use 0.19. Is there any reason the
> >> same goal could not be achieved using the deprecated (but present  <
> >> 0.20) MultiFileInputFormat ?
> >>
> >> Thanks,
> >> Roberto
> >
> >
> >
>
>