You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Navdeep Agrawal <Na...@symantec.com> on 2014/07/22 19:20:03 UTC
Hive Statistics
Hi ,
i am trying to compute statistics on ORC File but i am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what i want to use is max value stored in meta store ,but i am unable to catch these statistics .
my table desc is
load_inst_id int
src_filename string
server_date date
my analyze query is
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
i am always getting 0 as loadinstant id ,i have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))
RE: Hive Statistics
Posted by Navdeep Agrawal <Na...@symantec.com>.
Stuck .need help
I created a small table with multiple partition desc (id int ,term int) partitioned by id ,whenever I run analyze on any id I am getting perfectly good answers . I am unable to figure out the difference each file is making .
New table
Table Parameters:
transient_lastDdlTime 1406016417
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
inserting
insert into table statdevp partition(id) select id,term from statdev where id is not null and term is not null
analyze
analyze table statdevp partition(id=11) compute statistics for columns id;
I am able to see all values in part_col_stat for the partitions I am running analyze
and the orginal table :
desc
Table Parameters:
last_modified_by XXXXXXXXX
last_modified_time 1406047797
transient_lastDdlTime 1406047797
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
load_inst_id int src_filename string server_date date # Partition Information # col_name data_type comment server_date date
insertion
insert into devStat partition(server_date) select load_inst_id,src_filename,server_ts,server_date from rrslog_ext where server_date='2013-11-01' and load_inst_id is not null limit 100
analyze
analyze table devstat partition(server_date='2013-11-30') compute statistics for columns load_inst_id;
any help will be highly appreciated .stuck here long time …
thanks in advance .
From: Navdeep Agrawal [mailto:Navdeep_Agrawal@symantec.com]
Sent: Wednesday, July 23, 2014 3:17 PM
To: user@hive.apache.org
Subject: RE: Hive Statistics
No I have not set these to mysql db . when I set them to the one I am using for hive I am getting stat publisher not getting initialized .but if I have not set these parameters why every time a new row is getting created in mysql db in part_col_stats table .
From: Andre Araujo [mailto:araujo@pythian.com]
Sent: Wednesday, July 23, 2014 1:22 PM
To: user
Subject: Re: Hive Statistics
Hi, Navdeep,
Please note that the configuration for the stats database is separate from the configuration for the metastore db.
Can you confirm you have both to use a mysql db?
The properties for the stats db are:
hive.stats.dbclass=
hive.stats.dbconnectionstring=
On 23 July 2014 16:07, Navdeep Agrawal <Na...@symantec.com>> wrote:
Thank you Nitin for reply. I am using mysql database ,and also I can see new row created for the partition ,but all values are zero . I think explicitly giving mysql data base wont make a difference .
From: Nitin Pawar [mailto:nitinpawar432@gmail.com<ma...@gmail.com>]
Sent: Tuesday, July 22, 2014 11:05 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Statistics
by default hive stores the statistics in derby database.
If you want a persistent look at column statistics, you may want to create mysql based database for column statistics.
Your queries look fine
On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <Na...@symantec.com>> wrote:
Hi ,
i am trying to compute statistics on ORC File but i am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what i want to use is max value stored in meta store ,but i am unable to catch these statistics .
my table desc is
load_inst_id int
src_filename string
server_date date
my analyze query is
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
i am always getting 0 as loadinstant id ,i have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))
--
Nitin Pawar
--
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com<http://www.pythian.com>
Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000 x270 OR +1 613 565 8696 x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com<ma...@pythian.com> @ GTalk
“Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber)
--
RE: Hive Statistics
Posted by Navdeep Agrawal <Na...@symantec.com>.
No I have not set these to mysql db . when I set them to the one I am using for hive I am getting stat publisher not getting initialized .but if I have not set these parameters why every time a new row is getting created in mysql db in part_col_stats table .
From: Andre Araujo [mailto:araujo@pythian.com]
Sent: Wednesday, July 23, 2014 1:22 PM
To: user
Subject: Re: Hive Statistics
Hi, Navdeep,
Please note that the configuration for the stats database is separate from the configuration for the metastore db.
Can you confirm you have both to use a mysql db?
The properties for the stats db are:
hive.stats.dbclass=
hive.stats.dbconnectionstring=
On 23 July 2014 16:07, Navdeep Agrawal <Na...@symantec.com>> wrote:
Thank you Nitin for reply. I am using mysql database ,and also I can see new row created for the partition ,but all values are zero . I think explicitly giving mysql data base wont make a difference .
From: Nitin Pawar [mailto:nitinpawar432@gmail.com<ma...@gmail.com>]
Sent: Tuesday, July 22, 2014 11:05 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Hive Statistics
by default hive stores the statistics in derby database.
If you want a persistent look at column statistics, you may want to create mysql based database for column statistics.
Your queries look fine
On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <Na...@symantec.com>> wrote:
Hi ,
i am trying to compute statistics on ORC File but i am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what i want to use is max value stored in meta store ,but i am unable to catch these statistics .
my table desc is
load_inst_id int
src_filename string
server_date date
my analyze query is
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
i am always getting 0 as loadinstant id ,i have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))
--
Nitin Pawar
--
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com<http://www.pythian.com>
Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000 x270 OR +1 613 565 8696 x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com<ma...@pythian.com> @ GTalk
“Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber)
--
Re: Hive Statistics
Posted by Andre Araujo <ar...@pythian.com>.
Hi, Navdeep,
Please note that the configuration for the stats database is separate from
the configuration for the metastore db.
Can you confirm you have both to use a mysql db?
The properties for the stats db are:
hive.stats.dbclass=
hive.stats.dbconnectionstring=
On 23 July 2014 16:07, Navdeep Agrawal <Na...@symantec.com> wrote:
> Thank you Nitin for reply. I am using mysql database ,and also I can see
> new row created for the partition ,but all values are zero . I think
> explicitly giving mysql data base wont make a difference .
>
>
>
> *From:* Nitin Pawar [mailto:nitinpawar432@gmail.com]
> *Sent:* Tuesday, July 22, 2014 11:05 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive Statistics
>
>
>
> by default hive stores the statistics in derby database.
>
>
>
> If you want a persistent look at column statistics, you may want to create
> mysql based database for column statistics.
>
>
>
> Your queries look fine
>
>
>
> On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <
> Navdeep_Agrawal@symantec.com> wrote:
>
> Hi ,
>
>
>
> i am trying to compute statistics on ORC File but i am unable see any
> changes in PART_COL_STATS as well on using
>
> set hive.compute.query.using.stats=true;
> set hive.stats.reliable=true;
> set hive.stats.fetch.column.stats=true;
> set hive.stats.fetch.partition.stats=true;
> set hive.cbo.enable=true;
>
> to get max value of a column it is running full Map reduce on column ..
> what i want to use is max value stored in meta store ,but i am unable to
> catch these statistics .
>
> my table desc is
> load_inst_id int
> src_filename string
> server_date date
>
> my analyze query is
> analyze table mytable partition(server_date=’2013-11-30′) compute
> statistics for columns load_inst_id;
>
> i am always getting 0 as loadinstant id ,i have to turn off my
> hive.compute.query.using.stats to get correct result(through map reduce
> max(load_inst_id))
>
>
>
>
>
>
>
> --
> Nitin Pawar
>
--
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com
Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or araujo@pythian.com @ GTalk
“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)
--
--
RE: Hive Statistics
Posted by Navdeep Agrawal <Na...@symantec.com>.
Thank you Nitin for reply. I am using mysql database ,and also I can see new row created for the partition ,but all values are zero . I think explicitly giving mysql data base wont make a difference .
From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, July 22, 2014 11:05 PM
To: user@hive.apache.org
Subject: Re: Hive Statistics
by default hive stores the statistics in derby database.
If you want a persistent look at column statistics, you may want to create mysql based database for column statistics.
Your queries look fine
On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <Na...@symantec.com>> wrote:
Hi ,
i am trying to compute statistics on ORC File but i am unable see any changes in PART_COL_STATS as well on using
set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;
to get max value of a column it is running full Map reduce on column ..
what i want to use is max value stored in meta store ,but i am unable to catch these statistics .
my table desc is
load_inst_id int
src_filename string
server_date date
my analyze query is
analyze table mytable partition(server_date=’2013-11-30′) compute statistics for columns load_inst_id;
i am always getting 0 as loadinstant id ,i have to turn off my hive.compute.query.using.stats to get correct result(through map reduce max(load_inst_id))
--
Nitin Pawar
Re: Hive Statistics
Posted by Nitin Pawar <ni...@gmail.com>.
by default hive stores the statistics in derby database.
If you want a persistent look at column statistics, you may want to create
mysql based database for column statistics.
Your queries look fine
On Tue, Jul 22, 2014 at 10:50 PM, Navdeep Agrawal <
Navdeep_Agrawal@symantec.com> wrote:
> Hi ,
>
>
>
> i am trying to compute statistics on ORC File but i am unable see any
> changes in PART_COL_STATS as well on using
>
> set hive.compute.query.using.stats=true;
> set hive.stats.reliable=true;
> set hive.stats.fetch.column.stats=true;
> set hive.stats.fetch.partition.stats=true;
> set hive.cbo.enable=true;
>
> to get max value of a column it is running full Map reduce on column ..
> what i want to use is max value stored in meta store ,but i am unable to
> catch these statistics .
>
> my table desc is
> load_inst_id int
> src_filename string
> server_date date
>
> my analyze query is
> analyze table mytable partition(server_date=’2013-11-30′) compute
> statistics for columns load_inst_id;
>
> i am always getting 0 as loadinstant id ,i have to turn off my
> hive.compute.query.using.stats to get correct result(through map reduce
> max(load_inst_id))
>
>
>
--
Nitin Pawar