You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Dave Brondsema <db...@geek.net> on 2010/11/10 19:05:11 UTC

Re: Merging small files with dynamic partitions

Hi, has there been any resolution to this?  I'm having the same trouble.
 With Hive 0.6 and Hadoop 0.18 and a dynamic partition
insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
partition insert.  What I'm seeing is that even when I
set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
partition insert.

I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
like maybe Hadoop 0.20 is required for this?

Thanks,

On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu <sy...@brightedge.com> wrote:

> Hi guys,
>   Thanks for the response.   I tried running without
> hive.mergejob.maponly with the same result.  I've attached the explain
> extended output.  I am running this query on EC2 boxes, however it's
> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>
> Thanks,
> Sammy
>
> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang <nz...@facebook.com> wrote:
> > The output file shows it only have 2 jobs (the mapreduce job and the move
> task). This indicates that the plan does not have merge enabled. Merge
> should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
> task). Can you send the plan of the query?
> >
> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
> this is enabled since SET hive.mergejob.maponly=true requires
> CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
> some distribution of Hadoop doesn't support that). So additional thing you
> can try is to remove this setting.
> >
> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
> >
> >> Hi,
> >>  I have a dynamic partition query which generates quite a few small
> >> files which I would like to merge:
> >>
> >> SET hive.exec.dynamic.partition.mode=nonstrict;
> >> SET hive.exec.dynamic.partition=true;
> >> SET hive.exec.compress.output=true;
> >> SET io.seqfile.compression.type=BLOCK;
> >> SET hive.merge.size.per.task=256000000;
> >> SET hive.merge.smallfiles.avgsize=16000000000;
> >> SET hive.merge.mapfiles=true;
> >> SET hive.merge.mapredfiles=true;
> >> SET hive.mergejob.maponly=true;
> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> >> PARTITION(org_id, day)
> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
> quarter,
> >> referral_type, search_engine, us_search_engine,
> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> >> pages_viewed,
> >> entry_page, page_types,
> >> org_id, day
> >> FROM daily_conversions_without_rank_table;
> >>
> >> I am running the latest version from trunk with HIVE-1622, but it
> >> seems like I just can't get the post merge process to happen. I have
> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> >> at runtime is causing the merge process to be skipped.  Attached are
> >> the hive output and log files.
> >>
> >>
> >> Thanks,
> >> Sammy
> >> <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>
> >
> >
>
>
>
> --
> Chief Architect, BrightEdge
> email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
> 94404
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

Re: Merging small files with dynamic partitions

Posted by Dave Brondsema <db...@geek.net>.

I copied Hadoop19Shims' implementation of getCombineFileInputFormat
(HIVE-1121) into Hadoop18Shims and it worked, if anyone is interested.

And hopefully we can upgrade our Hadoop version soon :)

On Fri, Nov 12, 2010 at 12:44 PM, Dave Brondsema <db...@geek.net>wrote:

> It seems that I can't use this with Hadoop 0.18 since the
> Hadoop18Shims.getCombineFileInputFormat returns null, and
> SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
> CombineFileInputFormat is not supported.  Is that right?  Maybe I can copy
> the Hadoop19Shims implementation of getCombineFileInputFormat into
> Hadoop18Shims?
>
>
> On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he <he...@gmail.com>wrote:
>
>> I think the problem was solved in hive trunk. You can just try hive trunk.
>>
>> On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema <db...@geek.net>
>> wrote:
>> > Hi, has there been any resolution to this?  I'm having the same trouble.
>> >  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
>> > insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
>> > partition insert.  What I'm seeing is that even when I
>> > set hive.merge.mapredfiles=true, the jobconf has it as false for the
>> dynamic
>> > partition insert.
>> > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it
>> looks
>> > like maybe Hadoop 0.20 is required for this?
>> > Thanks,
>> >
>> > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu <sy...@brightedge.com> wrote:
>> >>
>> >> Hi guys,
>> >>   Thanks for the response.   I tried running without
>> >> hive.mergejob.maponly with the same result.  I've attached the explain
>> >> extended output.  I am running this query on EC2 boxes, however it's
>> >> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>> >>
>> >> Thanks,
>> >> Sammy
>> >>
>> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang <nz...@facebook.com>
>> wrote:
>> >> > The output file shows it only have 2 jobs (the mapreduce job and the
>> >> > move task). This indicates that the plan does not have merge enabled.
>> Merge
>> >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a
>> move
>> >> > task). Can you send the plan of the query?
>> >> >
>> >> > One thing I noticed is that your are using Amazon EMR. I'm not sure
>> if
>> >> > this is enabled since SET hive.mergejob.maponly=true requires
>> >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone
>> reported
>> >> > some distribution of Hadoop doesn't support that). So additional
>> thing you
>> >> > can try is to remove this setting.
>> >> >
>> >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
>> >> >
>> >> >> Hi,
>> >> >>  I have a dynamic partition query which generates quite a few small
>> >> >> files which I would like to merge:
>> >> >>
>> >> >> SET hive.exec.dynamic.partition.mode=nonstrict;
>> >> >> SET hive.exec.dynamic.partition=true;
>> >> >> SET hive.exec.compress.output=true;
>> >> >> SET io.seqfile.compression.type=BLOCK;
>> >> >> SET hive.merge.size.per.task=256000000;
>> >> >> SET hive.merge.smallfiles.avgsize=16000000000;
>> >> >> SET hive.merge.mapfiles=true;
>> >> >> SET hive.merge.mapredfiles=true;
>> >> >> SET hive.mergejob.maponly=true;
>> >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
>> >> >> PARTITION(org_id, day)
>> >> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
>> >> >> quarter,
>> >> >> referral_type, search_engine, us_search_engine,
>> >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
>> >> >> pages_viewed,
>> >> >> entry_page, page_types,
>> >> >> org_id, day
>> >> >> FROM daily_conversions_without_rank_table;
>> >> >>
>> >> >> I am running the latest version from trunk with HIVE-1622, but it
>> >> >> seems like I just can't get the post merge process to happen. I have
>> >> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the
>> filtering
>> >> >> at runtime is causing the merge process to be skipped.  Attached are
>> >> >> the hive output and log files.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Sammy
>> >> >> <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Chief Architect, BrightEdge
>> >> email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
>> >> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
>> >> 94404
>> >
>> >
>> >
>> > --
>> > Dave Brondsema
>> > Software Engineer
>> > Geeknet
>> >
>> > www.geek.net
>> >
>>
>
>
>
> --
> Dave Brondsema
> Software Engineer
> Geeknet
>
> www.geek.net
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

Re: Merging small files with dynamic partitions

Posted by Dave Brondsema <db...@geek.net>.

It seems that I can't use this with Hadoop 0.18 since the
Hadoop18Shims.getCombineFileInputFormat returns null, and
SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
CombineFileInputFormat is not supported.  Is that right?  Maybe I can copy
the Hadoop19Shims implementation of getCombineFileInputFormat into
Hadoop18Shims?

On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he <he...@gmail.com>wrote:

> I think the problem was solved in hive trunk. You can just try hive trunk.
>
> On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema <db...@geek.net>
> wrote:
> > Hi, has there been any resolution to this?  I'm having the same trouble.
> >  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
> > insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
> > partition insert.  What I'm seeing is that even when I
> > set hive.merge.mapredfiles=true, the jobconf has it as false for the
> dynamic
> > partition insert.
> > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it
> looks
> > like maybe Hadoop 0.20 is required for this?
> > Thanks,
> >
> > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu <sy...@brightedge.com> wrote:
> >>
> >> Hi guys,
> >>   Thanks for the response.   I tried running without
> >> hive.mergejob.maponly with the same result.  I've attached the explain
> >> extended output.  I am running this query on EC2 boxes, however it's
> >> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
> >>
> >> Thanks,
> >> Sammy
> >>
> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang <nz...@facebook.com>
> wrote:
> >> > The output file shows it only have 2 jobs (the mapreduce job and the
> >> > move task). This indicates that the plan does not have merge enabled.
> Merge
> >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a
> move
> >> > task). Can you send the plan of the query?
> >> >
> >> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
> >> > this is enabled since SET hive.mergejob.maponly=true requires
> >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone
> reported
> >> > some distribution of Hadoop doesn't support that). So additional thing
> you
> >> > can try is to remove this setting.
> >> >
> >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
> >> >
> >> >> Hi,
> >> >>  I have a dynamic partition query which generates quite a few small
> >> >> files which I would like to merge:
> >> >>
> >> >> SET hive.exec.dynamic.partition.mode=nonstrict;
> >> >> SET hive.exec.dynamic.partition=true;
> >> >> SET hive.exec.compress.output=true;
> >> >> SET io.seqfile.compression.type=BLOCK;
> >> >> SET hive.merge.size.per.task=256000000;
> >> >> SET hive.merge.smallfiles.avgsize=16000000000;
> >> >> SET hive.merge.mapfiles=true;
> >> >> SET hive.merge.mapredfiles=true;
> >> >> SET hive.mergejob.maponly=true;
> >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> >> >> PARTITION(org_id, day)
> >> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
> >> >> quarter,
> >> >> referral_type, search_engine, us_search_engine,
> >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> >> >> pages_viewed,
> >> >> entry_page, page_types,
> >> >> org_id, day
> >> >> FROM daily_conversions_without_rank_table;
> >> >>
> >> >> I am running the latest version from trunk with HIVE-1622, but it
> >> >> seems like I just can't get the post merge process to happen. I have
> >> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> >> >> at runtime is causing the merge process to be skipped.  Attached are
> >> >> the hive output and log files.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Sammy
> >> >> <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Chief Architect, BrightEdge
> >> email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
> >> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
> >> 94404
> >
> >
> >
> > --
> > Dave Brondsema
> > Software Engineer
> > Geeknet
> >
> > www.geek.net
> >
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

Re: Merging small files with dynamic partitions

Posted by yongqiang he <he...@gmail.com>.

I think the problem was solved in hive trunk. You can just try hive trunk.

On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema <db...@geek.net> wrote:
> Hi, has there been any resolution to this?  I'm having the same trouble.
>  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
> insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
> partition insert.  What I'm seeing is that even when I
> set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
> partition insert.
> I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
> like maybe Hadoop 0.20 is required for this?
> Thanks,
>
> On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu <sy...@brightedge.com> wrote:
>>
>> Hi guys,
>>   Thanks for the response.   I tried running without
>> hive.mergejob.maponly with the same result.  I've attached the explain
>> extended output.  I am running this query on EC2 boxes, however it's
>> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>>
>> Thanks,
>> Sammy
>>
>> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang <nz...@facebook.com> wrote:
>> > The output file shows it only have 2 jobs (the mapreduce job and the
>> > move task). This indicates that the plan does not have merge enabled. Merge
>> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
>> > task). Can you send the plan of the query?
>> >
>> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
>> > this is enabled since SET hive.mergejob.maponly=true requires
>> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
>> > some distribution of Hadoop doesn't support that). So additional thing you
>> > can try is to remove this setting.
>> >
>> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
>> >
>> >> Hi,
>> >>  I have a dynamic partition query which generates quite a few small
>> >> files which I would like to merge:
>> >>
>> >> SET hive.exec.dynamic.partition.mode=nonstrict;
>> >> SET hive.exec.dynamic.partition=true;
>> >> SET hive.exec.compress.output=true;
>> >> SET io.seqfile.compression.type=BLOCK;
>> >> SET hive.merge.size.per.task=256000000;
>> >> SET hive.merge.smallfiles.avgsize=16000000000;
>> >> SET hive.merge.mapfiles=true;
>> >> SET hive.merge.mapredfiles=true;
>> >> SET hive.mergejob.maponly=true;
>> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
>> >> PARTITION(org_id, day)
>> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
>> >> quarter,
>> >> referral_type, search_engine, us_search_engine,
>> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
>> >> pages_viewed,
>> >> entry_page, page_types,
>> >> org_id, day
>> >> FROM daily_conversions_without_rank_table;
>> >>
>> >> I am running the latest version from trunk with HIVE-1622, but it
>> >> seems like I just can't get the post merge process to happen. I have
>> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
>> >> at runtime is causing the merge process to be skipped.  Attached are
>> >> the hive output and log files.
>> >>
>> >>
>> >> Thanks,
>> >> Sammy
>> >> <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>
>> >
>> >
>>
>>
>>
>> --
>> Chief Architect, BrightEdge
>> email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
>> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
>> 94404
>
>
>
> --
> Dave Brondsema
> Software Engineer
> Geeknet
>
> www.geek.net
>