You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Prasanth J <bu...@gmail.com> on 2012/08/05 10:12:43 UTC

Storing statistics of input dataset

Hello everyone

Came across this excellent post about storing column statistics in Hive http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/

Does pig gather statistics similar to what hive does? I think gathering such statistics will be very helpful not only for cost based optimizer but in other cases like knowing the count of rows, knowing the histogram of underlying data etc.. In my case, I am working on cube computation for holistic measure where I need to know the count of rows, based on it I can load sample data set for determining the partition factor for large groups. I am sure gathering statistics and persisting it will help in other cases/optimizations as well.

If I am right, pig doesn't use cost based estimation while optimizing the logical plan instead I believe it uses rules of thumb (Plz. correct me if I am wrong). Having statistics about the datasets would help to provide better optimization (similar to the join optimization in the blog post). Any thoughts about having such statistics in pig and implementing ANALYZE command for gathering statistics?

Thanks
-- Prasanth Jayachandran

Re: Storing statistics of input dataset

Posted by Bill Graham <bi...@gmail.com>.

There are a few open JIRAs that are related to refactoring the query plan
code to allow for stats-based runtime optimizations:

https://issues.apache.org/jira/browse/PIG-483
https://issues.apache.org/jira/browse/PIG-2784

If anyone has thoughts/opinions around suggested design changes, those
JIRAs could be a good place to chime it.


On Mon, Aug 6, 2012 at 5:18 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> + 1 to that.
>
> We can get stats from the Hive metadata catalog via HCat. Loaders can
> already implement the LoadStatistics interface -- and if HCatLoader
> does this, we can create them via Hive and use that team's great work.
> We should also allow stats to be passed (and modified appropriately)
> through the dag, and instrument intermediate data writers to collect
> stats and send telemetry back for improved flow planning, but that's a
> separate conversation.
>
> D
>
> On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates <ga...@hortonworks.com> wrote:
> > Pig does not have a metadata store, so it doesn't store statistics on
> data.  However, through HCatalog it will have access to the same statistics
> that Hive stores.
> >
> > As far as using this data to optimize Pig operations, I'd like to rework
> the backend to start taking advantage of such statistics when available
> (either from metadata like this or statistics that are generated on the fly
> as scripts are executed).  I also hope to share as much of this work as
> possible with Hive so that both can benefit.
> >
> > Alan.
> >
> > On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
> >
> >> Hello everyone
> >>
> >> Came across this excellent post about storing column statistics in Hive
> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> >>
> >> Does pig gather statistics similar to what hive does? I think gathering
> such statistics will be very helpful not only for cost based optimizer but
> in other cases like knowing the count of rows, knowing the histogram of
> underlying data etc.. In my case, I am working on cube computation for
> holistic measure where I need to know the count of rows, based on it I can
> load sample data set for determining the partition factor for large groups.
> I am sure gathering statistics and persisting it will help in other
> cases/optimizations as well.
> >>
> >> If I am right, pig doesn't use cost based estimation while optimizing
> the logical plan instead I believe it uses rules of thumb (Plz. correct me
> if I am wrong). Having statistics about the datasets would help to provide
> better optimization (similar to the join optimization in the blog post).
> Any thoughts about having such statistics in pig and implementing ANALYZE
> command for gathering statistics?
> >>
> >> Thanks
> >> -- Prasanth Jayachandran
> >>
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Storing statistics of input dataset

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

+ 1 to that.

We can get stats from the Hive metadata catalog via HCat. Loaders can
already implement the LoadStatistics interface -- and if HCatLoader
does this, we can create them via Hive and use that team's great work.
We should also allow stats to be passed (and modified appropriately)
through the dag, and instrument intermediate data writers to collect
stats and send telemetry back for improved flow planning, but that's a
separate conversation.

D

On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates <ga...@hortonworks.com> wrote:
> Pig does not have a metadata store, so it doesn't store statistics on data.  However, through HCatalog it will have access to the same statistics that Hive stores.
>
> As far as using this data to optimize Pig operations, I'd like to rework the backend to start taking advantage of such statistics when available (either from metadata like this or statistics that are generated on the fly as scripts are executed).  I also hope to share as much of this work as possible with Hive so that both can benefit.
>
> Alan.
>
> On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
>
>> Hello everyone
>>
>> Came across this excellent post about storing column statistics in Hive http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
>>
>> Does pig gather statistics similar to what hive does? I think gathering such statistics will be very helpful not only for cost based optimizer but in other cases like knowing the count of rows, knowing the histogram of underlying data etc.. In my case, I am working on cube computation for holistic measure where I need to know the count of rows, based on it I can load sample data set for determining the partition factor for large groups. I am sure gathering statistics and persisting it will help in other cases/optimizations as well.
>>
>> If I am right, pig doesn't use cost based estimation while optimizing the logical plan instead I believe it uses rules of thumb (Plz. correct me if I am wrong). Having statistics about the datasets would help to provide better optimization (similar to the join optimization in the blog post). Any thoughts about having such statistics in pig and implementing ANALYZE command for gathering statistics?
>>
>> Thanks
>> -- Prasanth Jayachandran
>>
>

Re: Storing statistics of input dataset

Posted by Alan Gates <ga...@hortonworks.com>.

Pig does not have a metadata store, so it doesn't store statistics on data.  However, through HCatalog it will have access to the same statistics that Hive stores.  

As far as using this data to optimize Pig operations, I'd like to rework the backend to start taking advantage of such statistics when available (either from metadata like this or statistics that are generated on the fly as scripts are executed).  I also hope to share as much of this work as possible with Hive so that both can benefit.

Alan.

On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:

> Hello everyone
> 
> Came across this excellent post about storing column statistics in Hive http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> 
> Does pig gather statistics similar to what hive does? I think gathering such statistics will be very helpful not only for cost based optimizer but in other cases like knowing the count of rows, knowing the histogram of underlying data etc.. In my case, I am working on cube computation for holistic measure where I need to know the count of rows, based on it I can load sample data set for determining the partition factor for large groups. I am sure gathering statistics and persisting it will help in other cases/optimizations as well.
> 
> If I am right, pig doesn't use cost based estimation while optimizing the logical plan instead I believe it uses rules of thumb (Plz. correct me if I am wrong). Having statistics about the datasets would help to provide better optimization (similar to the join optimization in the blog post). Any thoughts about having such statistics in pig and implementing ANALYZE command for gathering statistics?
> 
> Thanks
> -- Prasanth Jayachandran
>