You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Guy Bayes <fa...@gmail.com> on 2011/05/26 22:25:43 UTC

questions about statistics in 0.7

Hello all, I'm new to this list,

I was wondering if anyone could answer a couple questions about the
implementation of statistics in 0.7?

I've reviewed
http://wiki.apache.org/hadoop/Hive/StatsDev

and have the following q

Re: questions about statistics in 0.7

Posted by Ning Zhang <nz...@fb.com>.

On May 26, 2011, at 1:28 PM, Guy Bayes wrote:

Crap sorry hit send too early

questions
1: Job overhead of generating statistics on the fly with

set hive.stats.autogather=true;?

Overhead is minimum. The only accountable overhead is to insert a row into a RDBMS/HBase at the end of a task. At the end of the query, there will be an aggregation query on the RDBMS/HBase. Trunk (0.8-snapshot) has some more optimizations to further reduce the overhead. Note that every DBMS/HBase operations can be timed out. You can also config the timeout value as appropriate.

2: Is stat descriptions in describe table extended implemented? I've gathered stats on a table but do not see the expected entries (rowNum = , etc) in the describe statement?

It is working. If rowNum is not there, there must be some error occurred during stats publishing or aggregation, which is designed to be forgiving for any exceptions so that it won't affect the main query. You can take a look at the hive log at /tmp/<username>/hive.log or the task log to search for Stats warning messages.

3: How does hive actually use stats to influence query plans? Any documentation?

Currently no optimizations are done based on these stats, although that's one of our intentions.

we are on CDH3 GA by the way

thanks
Guy

On Thu, May 26, 2011 at 1:25 PM, Guy Bayes <fa...@gmail.com>> wrote:
Hello all, I'm new to this list,

I was wondering if anyone could answer a couple questions about the implementation of statistics in 0.7?

I've reviewed
http://wiki.apache.org/hadoop/Hive/StatsDev

and have the following q

Re: questions about statistics in 0.7

Posted by Guy Bayes <fa...@gmail.com>.

Crap sorry hit send too early

questions
1: Job overhead of generating statistics on the fly with

set hive.stats.autogather=true;?

2: Is stat descriptions in describe table extended implemented? I've
gathered stats on a table but do not see the expected entries (rowNum = ,
etc) in the describe statement?

3: How does hive actually use stats to influence query plans? Any
documentation?

we are on CDH3 GA by the way

thanks
Guy

On Thu, May 26, 2011 at 1:25 PM, Guy Bayes <fa...@gmail.com> wrote:

> Hello all, I'm new to this list,
>
> I was wondering if anyone could answer a couple questions about the
> implementation of statistics in 0.7?
>
> I've reviewed
> http://wiki.apache.org/hadoop/Hive/StatsDev
>
> and have the following q
>