You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Stan Rosenberg <sr...@proclivitysystems.com> on 2011/10/04 05:09:34 UTC

output partitioning

Hi,

I'd like to store the output relation partitioned by

Re: output partitioning

Posted by Alan Gates <ga...@hortonworks.com>.
On Oct 5, 2011, at 5:21 AM, Alex Rovner wrote:

> Alan,
> 
> We are looking into integrating with the HCatalog and I have the following
> questions:
> 
> 1. In your opinion, how stable is the HCatalog?

We have a comprehensive test suite that we run on HCatalog regularly, as does Yahoo.  Also, Yahoo is running it on some of their clusters.  It is a fairly young project but at its core is Hive's metastore, which is mature and well tested.

> 2. On the install page it mentions the creation of the hive metastore db.
> What if we are already using Hive and have an existing metastore db in
> MySQL? What versions of Hive is the HCatalog compatible with?

HCatalog requires Hive metastore 0.7.1 plus a few patches, none of which change the database schema.  We have tested it with the 0.7.1 Hive client and Hive trunk.

Alan.


Re: output partitioning

Posted by Thejas Nair <th...@hortonworks.com>.
-thejas.
typed on a tiny virtual keyboard
On Oct 5, 2011 5:21 AM, "Alex Rovner" <al...@gmail.com> wrote:
> Alan,
>
> We are looking into integrating with the HCatalog and I have the following
> questions:
>
> 1. In your opinion, how stable is the HCatalog?
> 2. On the install page it mentions the creation of the hive metastore db.
> What if we are already using Hive and have an existing metastore db in
> MySQL? What versions of Hive is the HCatalog compatible with?
>
> Thanks in advance
>
> Alex R
>
> On Tue, Oct 4, 2011 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
>> That means one partition at a time, not the number of keys in the
>> partition. And in the 0.2 (just released), the one at a time restriction
is
>> removed. So you can partition data by client id and date.
>>
>> Alan.
>>
>> On Oct 4, 2011, at 11:12 AM, Stan Rosenberg wrote:
>>
>> > On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates <ga...@hortonworks.com>
>> wrote:
>> >
>> >> Can you explain what you mean by secondary output partitioning?
>> HCatalog
>> >> supports the same partitioning that Hive does.
>> >>
>> >
>> > "Currently HCatStorer only supports writing to one partition."
>> >
>> > We need to partition our data by client id, then by date, hence
>> two-level
>> > partitioning.
>>
>>

Re: output partitioning

Posted by Alex Rovner <al...@gmail.com>.
Alan,

We are looking into integrating with the HCatalog and I have the following
questions:

1. In your opinion, how stable is the HCatalog?
2. On the install page it mentions the creation of the hive metastore db.
What if we are already using Hive and have an existing metastore db in
MySQL? What versions of Hive is the HCatalog compatible with?

Thanks in advance

Alex R

On Tue, Oct 4, 2011 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:

> That means one partition at a time, not the number of keys in the
> partition.  And in the 0.2 (just released), the one at a time restriction is
> removed.  So you can partition data by client id and date.
>
> Alan.
>
> On Oct 4, 2011, at 11:12 AM, Stan Rosenberg wrote:
>
> > On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
> >
> >> Can you explain what you mean by secondary output partitioning?
>  HCatalog
> >> supports the same partitioning that Hive does.
> >>
> >
> > "Currently HCatStorer only supports writing to one partition."
> >
> > We need to partition our data  by client id, then by date, hence
> two-level
> > partitioning.
>
>

Re: output partitioning

Posted by Alan Gates <ga...@hortonworks.com>.
That means one partition at a time, not the number of keys in the partition.  And in the 0.2 (just released), the one at a time restriction is removed.  So you can partition data by client id and date.

Alan.

On Oct 4, 2011, at 11:12 AM, Stan Rosenberg wrote:

> On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates <ga...@hortonworks.com> wrote:
> 
>> Can you explain what you mean by secondary output partitioning?  HCatalog
>> supports the same partitioning that Hive does.
>> 
> 
> "Currently HCatStorer only supports writing to one partition."
> 
> We need to partition our data  by client id, then by date, hence two-level
> partitioning.


Re: output partitioning

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.
On Tue, Oct 4, 2011 at 2:06 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Can you explain what you mean by secondary output partitioning?  HCatalog
> supports the same partitioning that Hive does.
>

"Currently HCatStorer only supports writing to one partition."

We need to partition our data  by client id, then by date, hence two-level
partitioning.

Re: output partitioning

Posted by Alan Gates <ga...@hortonworks.com>.
Can you explain what you mean by secondary output partitioning?  HCatalog supports the same partitioning that Hive does.

Alan.

On Oct 4, 2011, at 11:01 AM, Stan Rosenberg wrote:

> On Tue, Oct 4, 2011 at 1:27 PM, Alan Gates <ga...@hortonworks.com> wrote:
> 
>> If you want to use Pig and Hive together, you should also consider
>> HCatalog, which was built exactly to address that use case.
>> http://incubator.apache.org/hcatalog
> 
> 
> We'll definitely consider HCatalog but unfortunately it does not seem to be
> ready for prime time.  Due to our data volume we need to have a secondary
> output partitioning; HCatalog
> does not yet support it.


Re: output partitioning

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.
On Tue, Oct 4, 2011 at 1:27 PM, Alan Gates <ga...@hortonworks.com> wrote:

> If you want to use Pig and Hive together, you should also consider
> HCatalog, which was built exactly to address that use case.
> http://incubator.apache.org/hcatalog


We'll definitely consider HCatalog but unfortunately it does not seem to be
ready for prime time.  Due to our data volume we need to have a secondary
output partitioning; HCatalog
does not yet support it.

Re: output partitioning

Posted by Alan Gates <ga...@hortonworks.com>.
If you want to use Pig and Hive together, you should also consider HCatalog, which was built exactly to address that use case.  http://incubator.apache.org/hcatalog/

Alan.

On Oct 4, 2011, at 10:24 AM, Thejas Nair wrote:

> See the piggybank store func -
> http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
> 
> Also, see piggybank load func - http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/piggybank/storage/AllLoader.html
> 
> -Thejas
> 
> 
> On 10/3/11 8:14 PM, Stan Rosenberg wrote:
>> Sorry folks, I've got to disable keyboard shortcuts in gmail.
>> 
>> I'd like to store the output relation partitioned by certain columns akin to
>> what hive does.  In fact, the ultimate goal is to leverage
>> hive's dynamic partitions to store the output from pig.  Any pointers are
>> greatly appreciated.
>> 
>> Thanks,
>> 
>> stan
>> 
>> On Mon, Oct 3, 2011 at 11:09 PM, Stan Rosenberg<
>> srosenberg@proclivitysystems.com>  wrote:
>> 
>>> Hi,
>>> 
>>> I'd like to store the output relation partitioned by
>>> 
>> 
> 


Re: output partitioning

Posted by Thejas Nair <th...@hortonworks.com>.
See the piggybank store func -
http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/piggybank/storage/MultiStorage.html

Also, see piggybank load func - 
http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/piggybank/storage/AllLoader.html

-Thejas


On 10/3/11 8:14 PM, Stan Rosenberg wrote:
> Sorry folks, I've got to disable keyboard shortcuts in gmail.
>
> I'd like to store the output relation partitioned by certain columns akin to
> what hive does.  In fact, the ultimate goal is to leverage
> hive's dynamic partitions to store the output from pig.  Any pointers are
> greatly appreciated.
>
> Thanks,
>
> stan
>
> On Mon, Oct 3, 2011 at 11:09 PM, Stan Rosenberg<
> srosenberg@proclivitysystems.com>  wrote:
>
>> Hi,
>>
>> I'd like to store the output relation partitioned by
>>
>


Re: output partitioning

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.
Sorry folks, I've got to disable keyboard shortcuts in gmail.

I'd like to store the output relation partitioned by certain columns akin to
what hive does.  In fact, the ultimate goal is to leverage
hive's dynamic partitions to store the output from pig.  Any pointers are
greatly appreciated.

Thanks,

stan

On Mon, Oct 3, 2011 at 11:09 PM, Stan Rosenberg <
srosenberg@proclivitysystems.com> wrote:

> Hi,
>
> I'd like to store the output relation partitioned by
>