You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Igor Tatarinov <ig...@decide.com> on 2013/08/20 23:29:32 UTC

single output file per partition?

What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE <table>
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
force a single reducer per partition but that didn't work. I still got
multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's
probably not a good idea.

My current idea is to create a temp table with the same partitioning
structure. Insert into that table first and then select * from that table
into the output table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
Will that work with a partitioned table?

Thanks!
igor

Re: single output file per partition?

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi

I tried file crusher with LZO but it does not work….I have LZO correctly configured in production and my jobs are running daily using LZO compression.

I like Crusher so I will see why its not working…Thanks to Edward the code is there to tweak :-)  and test locally


sanjay


From: Stephen Sprague <sp...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Wednesday, August 21, 2013 12:07 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: single output file per partition?

I see.  I'll have to punt then.  However, there is an after the fact file crusher Ed Capriolo wrote a while back here:  https://github.com/edwardcapriolo/filecrush YMMV


On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <ig...@decide.com>> wrote:
Using a single bucket per partition seems to create a single reducer which is too slow.
I've tried enforcing small files merge but that didn't work. I still got multiple output files.

Creating a temp table and then "combining" the multiple files into one using a simple select * is the only option that seems to work. It's odd that I have to create the temp table but I don't see a workaround.


On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sp...@gmail.com>> wrote:
hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first that "cluster by X into 1 bucket" didn't work?  I would have thought that would have done it.




On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <ig...@decide.com>> wrote:
What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE <table>
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a single reducer per partition but that didn't work. I still got multiple files per partition.

Do I have to use a single reduce task? With a few TB of data that's probably not a good idea.

My current idea is to create a temp table with the same partitioning structure. Insert into that table first and then select * from that table into the output table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles) Will that work with a partitioned table?

Thanks!
igor




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: single output file per partition?

Posted by Stephen Sprague <sp...@gmail.com>.

I see.  I'll have to punt then.  However, there is an after the fact file
crusher Ed Capriolo wrote a while back here:
https://github.com/edwardcapriolo/filecrush YMMV


On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <ig...@decide.com> wrote:

> Using a single bucket per partition seems to create a single reducer which
> is too slow.
>  I've tried enforcing small files merge but that didn't work. I still got
> multiple output files.
>
> Creating a temp table and then "combining" the multiple files into one
> using a simple select * is the only option that seems to work. It's odd
> that I have to create the temp table but I don't see a workaround.
>
>
> On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sp...@gmail.com>wrote:
>
>> hi igor,
>> lots of ideas there!  I can't speak for them all but let me confirm first
>> that "cluster by X into 1 bucket" didn't work?  I would have thought that
>> would have done it.
>>
>>
>>
>>
>> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>
>>> What's the best way to enforce a single output file per partition?
>>>
>>> INSERT OVERWRITE TABLE <table>
>>> PARTITION (x,y,z)
>>> SELECT ...
>>> FROM ...
>>> WHERE ...
>>>
>>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>>> force a single reducer per partition but that didn't work. I still got
>>> multiple files per partition.
>>>
>>> Do I have to use a single reduce task? With a few TB of data that's
>>> probably not a good idea.
>>>
>>> My current idea is to create a temp table with the same partitioning
>>> structure. Insert into that table first and then select * from that table
>>> into the output table. With combineinputformat=true that should work right?
>>>
>>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>>> Will that work with a partitioned table?
>>>
>>> Thanks!
>>> igor
>>>
>>
>>
>

Re: single output file per partition?

Posted by Igor Tatarinov <ig...@decide.com>.

Actually, using a temp table doesn't work either. Apparently, a single
mapper can read from multiple partitions (and output multiple files). There
is no way to force a single mapper per partition.


On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov <ig...@decide.com> wrote:

> Using a single bucket per partition seems to create a single reducer which
> is too slow.
> I've tried enforcing small files merge but that didn't work. I still got
> multiple output files.
>
> Creating a temp table and then "combining" the multiple files into one
> using a simple select * is the only option that seems to work. It's odd
> that I have to create the temp table but I don't see a workaround.
>
>
> On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sp...@gmail.com>wrote:
>
>> hi igor,
>> lots of ideas there!  I can't speak for them all but let me confirm first
>> that "cluster by X into 1 bucket" didn't work?  I would have thought that
>> would have done it.
>>
>>
>>
>>
>> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>
>>> What's the best way to enforce a single output file per partition?
>>>
>>> INSERT OVERWRITE TABLE <table>
>>> PARTITION (x,y,z)
>>> SELECT ...
>>> FROM ...
>>> WHERE ...
>>>
>>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>>> force a single reducer per partition but that didn't work. I still got
>>> multiple files per partition.
>>>
>>> Do I have to use a single reduce task? With a few TB of data that's
>>> probably not a good idea.
>>>
>>> My current idea is to create a temp table with the same partitioning
>>> structure. Insert into that table first and then select * from that table
>>> into the output table. With combineinputformat=true that should work right?
>>>
>>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>>> Will that work with a partitioned table?
>>>
>>> Thanks!
>>> igor
>>>
>>
>>
>

Re: single output file per partition?

Posted by Igor Tatarinov <ig...@decide.com>.

Using a single bucket per partition seems to create a single reducer which
is too slow.
I've tried enforcing small files merge but that didn't work. I still got
multiple output files.

Creating a temp table and then "combining" the multiple files into one
using a simple select * is the only option that seems to work. It's odd
that I have to create the temp table but I don't see a workaround.


On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <sp...@gmail.com> wrote:

> hi igor,
> lots of ideas there!  I can't speak for them all but let me confirm first
> that "cluster by X into 1 bucket" didn't work?  I would have thought that
> would have done it.
>
>
>
>
> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> What's the best way to enforce a single output file per partition?
>>
>> INSERT OVERWRITE TABLE <table>
>> PARTITION (x,y,z)
>> SELECT ...
>> FROM ...
>> WHERE ...
>>
>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>> force a single reducer per partition but that didn't work. I still got
>> multiple files per partition.
>>
>> Do I have to use a single reduce task? With a few TB of data that's
>> probably not a good idea.
>>
>> My current idea is to create a temp table with the same partitioning
>> structure. Insert into that table first and then select * from that table
>> into the output table. With combineinputformat=true that should work right?
>>
>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>> Will that work with a partitioned table?
>>
>> Thanks!
>> igor
>>
>
>

Re: single output file per partition?

Posted by Stephen Sprague <sp...@gmail.com>.

hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first
that "cluster by X into 1 bucket" didn't work?  I would have thought that
would have done it.




On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <ig...@decide.com> wrote:

> What's the best way to enforce a single output file per partition?
>
> INSERT OVERWRITE TABLE <table>
> PARTITION (x,y,z)
> SELECT ...
> FROM ...
> WHERE ...
>
> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
> force a single reducer per partition but that didn't work. I still got
> multiple files per partition.
>
> Do I have to use a single reduce task? With a few TB of data that's
> probably not a good idea.
>
> My current idea is to create a temp table with the same partitioning
> structure. Insert into that table first and then select * from that table
> into the output table. With combineinputformat=true that should work right?
>
> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
> Will that work with a partitioned table?
>
> Thanks!
> igor
>