You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alex Rovner <al...@gmail.com> on 2012/06/01 17:49:14 UTC

number of reducers

Hello,

We have wrote a HiveLoader that loads data from a hive warehouse
(HCatalogue had roadblocks at the time and we decided against using it)

We have one minor issue that would be great to solve: Currently pig cannot
estimate correctly how many reducers to use when loading data from a hive
warehouse.

We have looked through the code and traced the problem to the following:

Pig is using the location returned from "relativeToAbsolutePath" to figure
out how many reducers it needs. In the case of loading from Hive, we do not
know the paths that we need to load up until the setPartition() call is
made. We can of course set the root of the table as the path in the
"relativeToAbsolutePath" call but that would make pig over-estimate the
number of reducers needed since we wont take into account the partition
filtering that is taking place.

Are there any workarounds for this issue?
>From my understanding, it would be sufficient if the relativeToAbsolutePath
call was called after the setLocation and setPartition calls.

Any input would be appreciated.

Thanks
Alex

Re: number of reducers

Posted by Alan Gates <ga...@hortonworks.com>.

HCatalog does not require it's own metadata.  It shares the mysql database with Hive.  All of this is much cleaner and clearer now in the 0.4 release of HCatalog, since it is now a set of jars for use on the client side.  It depends on you setting up a Hive metastore first.

There is a Hive load function in Piggybank that you could look at, but hopefully with the new 0.4 release of HCatalog it will be light enough to accomplish what you need.

Alan.

On Jun 1, 2012, at 12:33 PM, Alex Rovner wrote:

> Ashutosh,
> 
> The issues we had are probably not relevant anymore as we were using 2.0
> release with pig version 8 or 9.
> 
> We can certainly attempt to use HCatalog again in the near future.
> 
> Two general question I had about the design:
> 
> Why does HCatalog need it's own MySQL database? Why cant it just leverage
> Hives metadata?
> Is it possible to create a loader / storage func that works off existing
> Hive metadata without the need to install Hcatalog. This could be useful to
> folks like us who don't need the full feature set of Hcatalog but would
> rather just want to read / write data to Hive with PIG.
> 
> Thanks
> Alex
> 
> On Fri, Jun 1, 2012 at 2:02 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
> 
>>>> (HCatalogue had roadblocks at the time and we decided against using it)
>> Off-topic, but it would be great if you can let us know what were those
>> roadblocks. We will try to address those if those are still there.
>> 
>> Thanks,
>> Ashutosh
>> 
>> On Fri, Jun 1, 2012 at 8:49 AM, Alex Rovner <al...@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> We have wrote a HiveLoader that loads data from a hive warehouse
>>> (HCatalogue had roadblocks at the time and we decided against using it)
>>> 
>>> We have one minor issue that would be great to solve: Currently pig
>> cannot
>>> estimate correctly how many reducers to use when loading data from a hive
>>> warehouse.
>>> 
>>> We have looked through the code and traced the problem to the following:
>>> 
>>> Pig is using the location returned from "relativeToAbsolutePath" to
>> figure
>>> out how many reducers it needs. In the case of loading from Hive, we do
>> not
>>> know the paths that we need to load up until the setPartition() call is
>>> made. We can of course set the root of the table as the path in the
>>> "relativeToAbsolutePath" call but that would make pig over-estimate the
>>> number of reducers needed since we wont take into account the partition
>>> filtering that is taking place.
>>> 
>>> Are there any workarounds for this issue?
>>> From my understanding, it would be sufficient if the
>> relativeToAbsolutePath
>>> call was called after the setLocation and setPartition calls.
>>> 
>>> Any input would be appreciated.
>>> 
>>> Thanks
>>> Alex
>>> 
>>

Re: number of reducers

Posted by Alex Rovner <al...@gmail.com>.

Ashutosh,

The issues we had are probably not relevant anymore as we were using 2.0
release with pig version 8 or 9.

We can certainly attempt to use HCatalog again in the near future.

Two general question I had about the design:

Why does HCatalog need it's own MySQL database? Why cant it just leverage
Hives metadata?
Is it possible to create a loader / storage func that works off existing
Hive metadata without the need to install Hcatalog. This could be useful to
folks like us who don't need the full feature set of Hcatalog but would
rather just want to read / write data to Hive with PIG.

Thanks
Alex

On Fri, Jun 1, 2012 at 2:02 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> >> (HCatalogue had roadblocks at the time and we decided against using it)
> Off-topic, but it would be great if you can let us know what were those
> roadblocks. We will try to address those if those are still there.
>
> Thanks,
> Ashutosh
>
> On Fri, Jun 1, 2012 at 8:49 AM, Alex Rovner <al...@gmail.com> wrote:
>
> > Hello,
> >
> > We have wrote a HiveLoader that loads data from a hive warehouse
> > (HCatalogue had roadblocks at the time and we decided against using it)
> >
> > We have one minor issue that would be great to solve: Currently pig
> cannot
> > estimate correctly how many reducers to use when loading data from a hive
> > warehouse.
> >
> > We have looked through the code and traced the problem to the following:
> >
> > Pig is using the location returned from "relativeToAbsolutePath" to
> figure
> > out how many reducers it needs. In the case of loading from Hive, we do
> not
> > know the paths that we need to load up until the setPartition() call is
> > made. We can of course set the root of the table as the path in the
> > "relativeToAbsolutePath" call but that would make pig over-estimate the
> > number of reducers needed since we wont take into account the partition
> > filtering that is taking place.
> >
> > Are there any workarounds for this issue?
> > From my understanding, it would be sufficient if the
> relativeToAbsolutePath
> > call was called after the setLocation and setPartition calls.
> >
> > Any input would be appreciated.
> >
> > Thanks
> > Alex
> >
>

Re: number of reducers

Posted by Ashutosh Chauhan <ha...@apache.org>.

>> (HCatalogue had roadblocks at the time and we decided against using it)
Off-topic, but it would be great if you can let us know what were those
roadblocks. We will try to address those if those are still there.

Thanks,
Ashutosh

On Fri, Jun 1, 2012 at 8:49 AM, Alex Rovner <al...@gmail.com> wrote:

> Hello,
>
> We have wrote a HiveLoader that loads data from a hive warehouse
> (HCatalogue had roadblocks at the time and we decided against using it)
>
> We have one minor issue that would be great to solve: Currently pig cannot
> estimate correctly how many reducers to use when loading data from a hive
> warehouse.
>
> We have looked through the code and traced the problem to the following:
>
> Pig is using the location returned from "relativeToAbsolutePath" to figure
> out how many reducers it needs. In the case of loading from Hive, we do not
> know the paths that we need to load up until the setPartition() call is
> made. We can of course set the root of the table as the path in the
> "relativeToAbsolutePath" call but that would make pig over-estimate the
> number of reducers needed since we wont take into account the partition
> filtering that is taking place.
>
> Are there any workarounds for this issue?
> From my understanding, it would be sufficient if the relativeToAbsolutePath
> call was called after the setLocation and setPartition calls.
>
> Any input would be appreciated.
>
> Thanks
> Alex
>

Re: number of reducers

Posted by Alex Rovner <al...@gmail.com>.

Thanks Bill. This is exactly what I was looking for.

We are using version 11 but not the latest from the trunk.

I would have to rebuild using latest.

Alex

On Fri, Jun 1, 2012 at 12:55 PM, Bill Graham <bi...@gmail.com> wrote:

> What version of Pig are you running, and if it's not the trunk can you try
> with the trunk?
>
> There have been a number of improvements to how we get total input size
> when estimating reducers. Basically, the input size is now requested from
> the LoadFunc, which has more info about statistics.
>
> See
> https://issues.apache.org/jira/browse/PIG-2573
> https://issues.apache.org/jira/browse/PIG-2693
>
> On Fri, Jun 1, 2012 at 8:49 AM, Alex Rovner <al...@gmail.com> wrote:
>
> > Hello,
> >
> > We have wrote a HiveLoader that loads data from a hive warehouse
> > (HCatalogue had roadblocks at the time and we decided against using it)
> >
> > We have one minor issue that would be great to solve: Currently pig
> cannot
> > estimate correctly how many reducers to use when loading data from a hive
> > warehouse.
> >
> > We have looked through the code and traced the problem to the following:
> >
> > Pig is using the location returned from "relativeToAbsolutePath" to
> figure
> > out how many reducers it needs. In the case of loading from Hive, we do
> not
> > know the paths that we need to load up until the setPartition() call is
> > made. We can of course set the root of the table as the path in the
> > "relativeToAbsolutePath" call but that would make pig over-estimate the
> > number of reducers needed since we wont take into account the partition
> > filtering that is taking place.
> >
> > Are there any workarounds for this issue?
> > From my understanding, it would be sufficient if the
> relativeToAbsolutePath
> > call was called after the setLocation and setPartition calls.
> >
> > Any input would be appreciated.
> >
> > Thanks
> > Alex
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: number of reducers

Posted by Bill Graham <bi...@gmail.com>.

What version of Pig are you running, and if it's not the trunk can you try
with the trunk?

There have been a number of improvements to how we get total input size
when estimating reducers. Basically, the input size is now requested from
the LoadFunc, which has more info about statistics.

See
https://issues.apache.org/jira/browse/PIG-2573
https://issues.apache.org/jira/browse/PIG-2693

On Fri, Jun 1, 2012 at 8:49 AM, Alex Rovner <al...@gmail.com> wrote:

> Hello,
>
> We have wrote a HiveLoader that loads data from a hive warehouse
> (HCatalogue had roadblocks at the time and we decided against using it)
>
> We have one minor issue that would be great to solve: Currently pig cannot
> estimate correctly how many reducers to use when loading data from a hive
> warehouse.
>
> We have looked through the code and traced the problem to the following:
>
> Pig is using the location returned from "relativeToAbsolutePath" to figure
> out how many reducers it needs. In the case of loading from Hive, we do not
> know the paths that we need to load up until the setPartition() call is
> made. We can of course set the root of the table as the path in the
> "relativeToAbsolutePath" call but that would make pig over-estimate the
> number of reducers needed since we wont take into account the partition
> filtering that is taking place.
>
> Are there any workarounds for this issue?
> From my understanding, it would be sufficient if the relativeToAbsolutePath
> call was called after the setLocation and setPartition calls.
>
> Any input would be appreciated.
>
> Thanks
> Alex
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*