You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ian <li...@yahoo.com> on 2013/04/05 01:01:19 UTC

Partition performance

Hi,
 
I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods.
 
1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions.
        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int);
        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16';
 
2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions.        CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02';
 
When doing a simple query like 
    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. 
 
I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy?
 
Thanks.

Re: Partition performance

Posted by Ian <li...@yahoo.com>.

There is only one map task because it's using the CombineHiveInputFormat (In my test cases, all files are very small). If I set hive.input.format to HiveInputFormat, then it has 336 map tasks in the first case. But the performance is even worse since there are too many map tasks and each one is only handling a small file.
 
It takes a lot of time before it actually submits the job. So maybe querying the metastore for partition info takes time?
  

________________________________
 From: Ramki Palle <ra...@gmail.com>
To: user@hive.apache.org; Ian <li...@yahoo.com> 
Sent: Friday, April 5, 2013 1:12 PM
Subject: Re: Partition performance
  


Can you tell how many map tasks are there in each scenario?

If my assumption is correct, you should have 336 in the first case and 14 in second case.



It looks like it is combing all small files in a folder and running as one map task for all 24 files in a folder, whereas it is running a separate task in these files are there in different partitions (folders).


You can try to reuse the JVM and see if the response time is similar.

Can you please try the following and let us know how long each strategy takes?


hive> set mapred.job.reuse.jvm.num.tasks = 24;


Run your  query that has more partitions and see if the response time is lower.


Regards,

Ramki.




On Fri, Apr 5, 2013 at 11:36 AM, Ian <li...@yahoo.com> wrote:

Thanks. This is just a test from my local box. So each file is only 1kb. I shared the query plans of these two tests at: 
>http://codetidy.com/paste/raw/5198
>http://codetidy.com/paste/raw/5199 
> 
>Also in the Hadoop log, there is this line for each partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work list for file hdfs://localhost:8020/test1/2011/02/01/01 
>Does that mean each partition will become a map task?
>
>I'm still new in Hive, just wondering what are the common strategy for partitioning the hourly logs? I know we shouldn't have too many partitions but I'm wondering what's the reason behind it? If I run this on a real cluster, maybe it won't perform so differently?
>
>Thanks.  
> From: Dean Wampler <de...@thinkbiganalytics.com>
>To: user@hive.apache.org 
>Sent: Thursday, April 4, 2013 4:28 PM
>Subject: Re: Partition performance
> 
>
>
>Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do.
>
>
>You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduce tasks, while the faster one has fewer tasks that run longer. A rule of thumb is that any one task should take 20 seconds or more to amortize over the few seconds spent in start up per task. 
>
>
>In other words, if you think about what's happening at the HDFS and MR level, you can learn to predict how fast or slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this. 
>
>
>dean
>
>
>On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <om...@apache.org> wrote:
>
>See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000.
>>
>>-- Owen
>>
>>
>>
>>On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com> wrote:
>>
>>Is it possible for you to send the explain plan of these two queries?
>>>
>>>Regards,
>>>Ramki.
>>>
>>>
>>>
>>>
>>>On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <Sa...@wizecommerce.com> wrote:
>>>
>>>The slow down is most possibly due to large number of partitions. 
>>>>I believe the Hive book authors tell us to be cautious with large number of partitions :-)  and I abide by that. 
>>>>
>>>> 
>>>>Users 
>>>>Please add your points of view and experiences 
>>>>
>>>> 
>>>>Thanks 
>>>>sanjay 
>>>>
>>>> From: Ian <li...@yahoo.com>
>>>>Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <li...@yahoo.com>
>>>>Date: Thursday, April 4, 2013 4:01 PM
>>>>To: "user@hive.apache.org" <us...@hive.apache.org>
>>>>Subject: Partition performance
>>>>
>>>>
>>>> 
>>>>Hi, 
>>>>
>>>>I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 
>>>>
>>>>1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. 
>>>>        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); 
>>>>        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 
>>>>  
>>>>2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions.         CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); 
>>>>        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; 
>>>>  
>>>>When doing a simple query like  
>>>>    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'  
>>>>Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.  
>>>>
>>>>I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? 
>>>>  
>>>>Thanks.  
>>>>
>>>>   
>>>>
>>>>CONFIDENTIALITY NOTICE
>>>>======================
>>>>This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.
>>>>  
>>> 
>> 
>
>
>
>-- 
>Dean Wampler, Ph.D.
>thinkbiganalytics.com
>+1-312-339-1330 
>
>  
>
>

Re: Partition performance

Posted by Ramki Palle <ra...@gmail.com>.

Can you tell how many map tasks are there in each scenario?

If my assumption is correct, you should have 336 in the first case and 14
in second case.


It looks like it is combing all small files in a folder and running as one
map task for all 24 files in a folder, whereas it is running a separate
task in these files are there in different partitions (folders).

You can try to reuse the JVM and see if the response time is similar.

Can you please try the following and let us know how long each strategy
takes?

hive> set mapred.job.reuse.jvm.num.tasks = 24;

Run your  query that has more partitions and see if the response time is
lower.

Regards,
Ramki.


On Fri, Apr 5, 2013 at 11:36 AM, Ian <li...@yahoo.com> wrote:

> Thanks. This is just a test from my local box. So each file is only 1kb. I
> shared the query plans of these two tests at:
> http://codetidy.com/paste/raw/5198
> http://codetidy.com/paste/raw/5199
>
> Also in the Hadoop log, there is this line for each partition:
> org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work
> list for file hdfs://localhost:8020/test1/2011/02/01/01
> Does that mean each partition will become a map task?
>
> I'm still new in Hive, just wondering what are the common strategy for
> partitioning the hourly logs? I know we shouldn't have too many partitions
> but I'm wondering what's the reason behind it? If I run this on a real
> cluster, maybe it won't perform so differently?
>
> Thanks.
>
>   *From:* Dean Wampler <de...@thinkbiganalytics.com>
> *To:* user@hive.apache.org
> *Sent:* Thursday, April 4, 2013 4:28 PM
> *Subject:* Re: Partition performance
>
> Also, how big are the files in each directory? Are they roughly the size
> of one HDFS block or a multiple. Lots of small files will mean lots of
> mapper tasks will little to do.
>
> You can also compare the job tracker console output for each job. I bet
> the slow one has a lot of very short map and reduce tasks, while the faster
> one has fewer tasks that run longer. A rule of thumb is that any one task
> should take 20 seconds or more to amortize over the few seconds spent in
> start up per task.
>
> In other words, if you think about what's happening at the HDFS and MR
> level, you can learn to predict how fast or slow things will run. Learning
> to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.
>
> dean
>
> On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <om...@apache.org> wrote:
>
> See slide #9 from my Optimizing Hive Queries talk
> http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
> will improve it, but for now you are much better off with 1,000 partitions
> than 10,000.
>
> -- Owen
>
>
> On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com> wrote:
>
> Is it possible for you to send the explain plan of these two queries?
>
> Regards,
> Ramki.
>
>
> On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>  The slow down is most possibly due to large number of partitions.
> I believe the Hive book authors tell us to be cautious with large number
> of partitions :-)  and I abide by that.
>
>  Users
> Please add your points of view and experiences
>
>  Thanks
> sanjay
>
>   From: Ian <li...@yahoo.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <
> liuhy0@yahoo.com>
> Date: Thursday, April 4, 2013 4:01 PM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Partition performance
>
>   Hi,
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.
>
> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.
>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);
>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';
>
> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.
>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);
>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';
>
> When doing a simple query like
>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
> '2013-02-14'
>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?
>
> Thanks.
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>
>

Re: Partition performance

Posted by Ian <li...@yahoo.com>.

Thanks. This is just a test from my local box. So each file is only 1kb. I shared the query plans of these two tests at:
http://codetidy.com/paste/raw/5198
http://codetidy.com/paste/raw/5199
 
Also in the Hadoop log, there is this line for each partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work list for file hdfs://localhost:8020/test1/2011/02/01/01
Does that mean each partition will become a map task?
 
I'm still new in Hive, just wondering what are the common strategy for partitioning the hourly logs? I know we shouldn't have too many partitions but I'm wondering what's the reason behind it? If I run this on a real cluster, maybe it won't perform so differently?
 
Thanks.  

________________________________
 From: Dean Wampler <de...@thinkbiganalytics.com>
To: user@hive.apache.org 
Sent: Thursday, April 4, 2013 4:28 PM
Subject: Re: Partition performance
  

Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do.

You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduce tasks, while the faster one has fewer tasks that run longer. A rule of thumb is that any one task should take 20 seconds or more to amortize over the few seconds spent in start up per task. 

In other words, if you think about what's happening at the HDFS and MR level, you can learn to predict how fast or slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this. 

dean


On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <om...@apache.org> wrote:

See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000.
>
>-- Owen
>
>
>
>On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com> wrote:
>
>Is it possible for you to send the explain plan of these two queries?
>>
>>Regards,
>>Ramki.
>>
>>
>>
>>
>>On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <Sa...@wizecommerce.com> wrote:
>>
>>The slow down is most possibly due to large number of partitions. 
>>>I believe the Hive book authors tell us to be cautious with large number of partitions :-)  and I abide by that. 
>>>
>>> 
>>>Users 
>>>Please add your points of view and experiences 
>>>
>>> 
>>>Thanks 
>>>sanjay 
>>>
>>> From: Ian <li...@yahoo.com>
>>>Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <li...@yahoo.com>
>>>Date: Thursday, April 4, 2013 4:01 PM
>>>To: "user@hive.apache.org" <us...@hive.apache.org>
>>>Subject: Partition performance
>>>
>>>
>>> 
>>>Hi, 
>>>
>>>I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 
>>>
>>>1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. 
>>>        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); 
>>>        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 
>>>  
>>>2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions.         CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); 
>>>        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; 
>>>  
>>>When doing a simple query like  
>>>    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'  
>>>Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.  
>>>
>>>I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? 
>>>  
>>>Thanks.  
>>>
>>>   
>>>
>>>CONFIDENTIALITY NOTICE
>>>======================
>>>This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review
 and disclosure by the sender's Email System Administrator.
>>>  
>> 
> 


-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330

Re: Partition performance

Posted by Dean Wampler <de...@thinkbiganalytics.com>.

Also, how big are the files in each directory? Are they roughly the size of
one HDFS block or a multiple. Lots of small files will mean lots of mapper
tasks will little to do.

You can also compare the job tracker console output for each job. I bet the
slow one has a lot of very short map and reduce tasks, while the faster one
has fewer tasks that run longer. A rule of thumb is that any one task
should take 20 seconds or more to amortize over the few seconds spent in
start up per task.

In other words, if you think about what's happening at the HDFS and MR
level, you can learn to predict how fast or slow things will run. Learning
to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.

dean

On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <om...@apache.org> wrote:

> See slide #9 from my Optimizing Hive Queries talk
> http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
> will improve it, but for now you are much better off with 1,000 partitions
> than 10,000.
>
> -- Owen
>
>
> On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com> wrote:
>
>> Is it possible for you to send the explain plan of these two queries?
>>
>> Regards,
>> Ramki.
>>
>>
>> On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
>> Sanjay.Subramanian@wizecommerce.com> wrote:
>>
>>>  The slow down is most possibly due to large number of partitions.
>>> I believe the Hive book authors tell us to be cautious with large number
>>> of partitions :-)  and I abide by that.
>>>
>>>  Users
>>> Please add your points of view and experiences
>>>
>>>  Thanks
>>> sanjay
>>>
>>>   From: Ian <li...@yahoo.com>
>>> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <
>>> liuhy0@yahoo.com>
>>> Date: Thursday, April 4, 2013 4:01 PM
>>> To: "user@hive.apache.org" <us...@hive.apache.org>
>>> Subject: Partition performance
>>>
>>>   Hi,
>>>
>>> I created 3 years of hourly log files (totally 26280 files), and use
>>> External Table with partition to query. I tried two partition methods.
>>>
>>> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory
>>> per hour). Use date and hour as partition keys. Add 3 years of directories
>>> to the table partitions. So there are 26280 partitions.
>>>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
>>> string, hr int);
>>>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
>>> LOCATION '/test1/2013/04/02/16';
>>>
>>> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory
>>> per day, 24 files in each directory). Use date as partition key. Add 3
>>> years of directories to the table partitions. So there are 1095 partitions.
>>>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY
>>> (dt string);
>>>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
>>> '/test2/2013/04/02';
>>>
>>> When doing a simple query like
>>>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
>>> '2013-02-14'
>>>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>>>
>>> I'm wondering why there is a big performance difference between these
>>> two? These two approaches have the same number of files, only the directory
>>> structure is different. So Hive is going to load the same amount of files.
>>> Why does the number of partitions have such big impact? Does that mean #2
>>> is a better partition strategy?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> ======================
>>> This email message and any attachments are for the exclusive use of the
>>> intended recipient(s) and may contain confidential and privileged
>>> information. Any unauthorized review, use, disclosure or distribution is
>>> prohibited. If you are not the intended recipient, please contact the
>>> sender by reply email and destroy all copies of the original message along
>>> with any attachments, from your computer system. If you are the intended
>>> recipient, please be advised that the content of this message is subject to
>>> access, review and disclosure by the sender's Email System Administrator.
>>>
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Partition performance

Posted by Dean Wampler <de...@gmail.com>.

How big were the files in each case in your experiment? Having lots of
small files will add Hadoop overhead.

Also, it would be useful to know the execution times of the map and reduce
tasks. The rule of thumb is that under 20 seconds each, or so, you're
paying a significant of the execution time in startup and shutdown overhead.

Of course, another factor is the number of tasks your cluster can run in
parallel. Scanning 20K partitions with a 1K MapReduce slot capacity over
the cluster will obviously take ~20 passes vs. ~1 pass for 1K partitions.

dean

On Tue, Jul 2, 2013 at 4:34 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  ...
>
> ** **
>
> *From: *Ian <li...@yahoo.com>
> *Reply-To: *"user@hive.apache.org" <us...@hive.apache.org>, Ian <
> liuhy0@yahoo.com>
> *Date: *Thursday, April 4, 2013 4:01 PM
> *To: *"user@hive.apache.org" <us...@hive.apache.org>
> *Subject: *Partition performance****
>
> ** **
>
> Hi,****
>
>  ****
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.****
>
>  ****
>
> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.****
>
>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);****
>
>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';****
>
>  ****
>
> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.****
>
>         CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);****
>
>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';****
>
>  ****
>
> When doing a simple query like ****
>
>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
> '2013-02-14'****
>
> Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. ****
>
>  ****
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?****
>
>  ****
>
> Thanks.****
>
>  ****
>
>  ****
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>
> ** **
>
> ** **
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

RE: Partition performance

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Sorry, just caught up with the last couple of day’s email and I feel that this question
has already been answered fairly comprehensively. Apologies.

Z

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 04 July 2013 08:37
To: user@hive.apache.org
Subject: RE: Partition performance

Hi,

Just to check that I understand this problem, my reading suggests that the overhead of
many partitions is currently unavoidable. Specifically this means that any query on a table that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” data) even if
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)

Regards,

Z

From: Owen O'Malley [mailto:omalley@apache.org]
Sent: 02 July 2013 15:52
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that they should be looking at. Most tables at 1000 partitions won't cause big slow downs, but the cost scales with the number of partitions. By the time you are at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to be able to explain the situation.)

I think this is the one they will use: https://issues.apache.org/jira/browse/HIVE-4051

-- Owen

RE: Partition performance

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

Just to check that I understand this problem, my reading suggests that the overhead of
many partitions is currently unavoidable. Specifically this means that any query on a table that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” data) even if
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)

Regards,

Z

From: Owen O'Malley [mailto:omalley@apache.org]
Sent: 02 July 2013 15:52
To: user@hive.apache.org
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that they should be looking at. Most tables at 1000 partitions won't cause big slow downs, but the cost scales with the number of partitions. By the time you are at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to be able to explain the situation.)

I think this is the one they will use: https://issues.apache.org/jira/browse/HIVE-4051

-- Owen

Re: Partition performance

Posted by Edward Capriolo <ed...@gmail.com>.

1) each partition object is a row in the metastore usually mysql, querying
large tables with many partitions has longer startup time as the hive query
planner has to fetch and process all of this meta-information. This is not
a distributed process. It is usually fast within a few seconds but for very
large partitions it can be slow.

2) hadoop's small files problem. <- google that. Small files end up being
much more overhead for a given map reduce job, generally the more
files/partitions the more map/reduce tasks. More map reduce tasks is more
overhead, more overhead is less throughput.

::SHAMELESS PLUG:: We discuss this in detail the book programming hive, in
the schema design section

On Wed, Jul 3, 2013 at 8:19 AM, David Morel <dm...@gmail.com> wrote:

> On 2 Jul 2013, at 16:51, Owen O'Malley wrote:
>
> > On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
> > Peter.Marron@trilliumsoftware.com> wrote:
> >
> >> Hi Owen,****
> >>
> >> ** **
> >>
> >> I’m curious about this advice about partitioning. Is there some
> >> fundamental reason why Hive****
> >>
> >> is slow when the number of partitions is 10,000 rather than 1,000?
> >>
> >
> > The precise numbers don't matter. I wanted to give people a ballpark
> range
> > that they should be looking at. Most tables at 1000 partitions won't
> cause
> > big slow downs, but the cost scales with the number of partitions. By the
> > time you are at 10,000 the cost is noticeable. I have one customer who
> has
> > a table with 1.2 million partitions. That causes a lot of slow downs.
>
> That is still not really answering the question, which is: why is it slower
> to run a query on a heavily partitioned table than it is on the same number
> of files in a less heavily partitioned table.
>
> David
>

Re: Partition performance

Posted by Owen O'Malley <om...@apache.org>.

On Wed, Jul 3, 2013 at 5:19 AM, David Morel <dm...@gmail.com> wrote:

>
> That is still not really answering the question, which is: why is it slower
> to run a query on a heavily partitioned table than it is on the same number
> of files in a less heavily partitioned table.
>

According to Gopal's investigations in
https://issues.apache.org/jira/browse/HIVE-4051, each time Hive plans a
query, it does a query per a partition to the backing SQL database. That
would explain a lot of the latency for tables with large numbers of
partitions.

-- Owen

Re: Partition performance

Posted by David Morel <dm...@gmail.com>.

On 2 Jul 2013, at 16:51, Owen O'Malley wrote:

> On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:
>
>> Hi Owen,****
>>
>> ** **
>>
>> I’m curious about this advice about partitioning. Is there some
>> fundamental reason why Hive****
>>
>> is slow when the number of partitions is 10,000 rather than 1,000?
>>
>
> The precise numbers don't matter. I wanted to give people a ballpark range
> that they should be looking at. Most tables at 1000 partitions won't cause
> big slow downs, but the cost scales with the number of partitions. By the
> time you are at 10,000 the cost is noticeable. I have one customer who has
> a table with 1.2 million partitions. That causes a lot of slow downs.

That is still not really answering the question, which is: why is it slower
to run a query on a heavily partitioned table than it is on the same number 
of files in a less heavily partitioned table.

David

Re: Partition performance

Posted by Owen O'Malley <om...@apache.org>.

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Owen,****
>
> ** **
>
> I’m curious about this advice about partitioning. Is there some
> fundamental reason why Hive****
>
> is slow when the number of partitions is 10,000 rather than 1,000?
>

The precise numbers don't matter. I wanted to give people a ballpark range
that they should be looking at. Most tables at 1000 partitions won't cause
big slow downs, but the cost scales with the number of partitions. By the
time you are at 10,000 the cost is noticeable. I have one customer who has
a table with 1.2 million partitions. That causes a lot of slow downs.

> And the improvements****
>
> that you mention are they going to be in version 12? Is there a JIRA
> raised so that I can track them?****
>
> (It’s not currently a problem for me but I can see that I am going to need
> to be able to explain the situation.)
>

I think this is the one they will use:
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen

RE: Partition performance

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000? And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to be able to explain the situation.)

Warm regards,

Z

From: Owen O'Malley [mailto:omalley@apache.org]
Sent: 05 April 2013 00:26
To: user@hive.apache.org
Subject: Re: Partition performance

See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000.

-- Owen

On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com>> wrote:
Is it possible for you to send the explain plan of these two queries?
Regards,
Ramki.

On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian <li...@yahoo.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>, Ian <li...@yahoo.com>>
Date: Thursday, April 4, 2013 4:01 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions.
        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int);
        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions.
        CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02';

When doing a simple query like
    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Partition performance

Posted by Owen O'Malley <om...@apache.org>.

See slide #9 from my Optimizing Hive Queries talk
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
will improve it, but for now you are much better off with 1,000 partitions
than 10,000.

-- Owen


On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ra...@gmail.com> wrote:

> Is it possible for you to send the explain plan of these two queries?
>
> Regards,
> Ramki.
>
>
> On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>>  The slow down is most possibly due to large number of partitions.
>> I believe the Hive book authors tell us to be cautious with large number
>> of partitions :-)  and I abide by that.
>>
>>  Users
>> Please add your points of view and experiences
>>
>>  Thanks
>> sanjay
>>
>>   From: Ian <li...@yahoo.com>
>> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <
>> liuhy0@yahoo.com>
>> Date: Thursday, April 4, 2013 4:01 PM
>> To: "user@hive.apache.org" <us...@hive.apache.org>
>> Subject: Partition performance
>>
>>   Hi,
>>
>> I created 3 years of hourly log files (totally 26280 files), and use
>> External Table with partition to query. I tried two partition methods.
>>
>> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory
>> per hour). Use date and hour as partition keys. Add 3 years of directories
>> to the table partitions. So there are 26280 partitions.
>>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
>> string, hr int);
>>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
>> LOCATION '/test1/2013/04/02/16';
>>
>> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory
>> per day, 24 files in each directory). Use date as partition key. Add 3
>> years of directories to the table partitions. So there are 1095 partitions.
>>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
>> string);
>>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
>> '/test2/2013/04/02';
>>
>> When doing a simple query like
>>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
>> '2013-02-14'
>>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>>
>> I'm wondering why there is a big performance difference between these
>> two? These two approaches have the same number of files, only the directory
>> structure is different. So Hive is going to load the same amount of files.
>> Why does the number of partitions have such big impact? Does that mean #2
>> is a better partition strategy?
>>
>> Thanks.
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: Correct syntax for EXPLAIN DEPENDENCY

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Thanks
I did upgrade but got stumped with this so reverted back
https://issues.cloudera.org/browse/DISTRO-461

Regards
sanjay

On 4/4/13 7:37 PM, "Jarek Jarcec Cecho" <ja...@apache.org> wrote:

>Hi Sanjay,
>you can upgrade to CDH4.2.0 that contains Hive 0.10.
>
>Jarcec
>
>On Fri, Apr 05, 2013 at 01:48:39AM +0000, Sanjay Subramanian wrote:
>> Ah its available only in 0.10.0 :-(
>> And I am still using 0.9.x from the CDH4.1.2 distribution
>>
>>
>> From: Sanjay Subramanian
>><sanjay.subramanian@wizecommerce.com<mailto:sanjay.subramanian@wizecommer
>>ce.com>>
>> Reply-To: "user@hive.apache.org<ma...@hive.apache.org>"
>><us...@hive.apache.org>>
>> Date: Thursday, April 4, 2013 6:40 PM
>> To: "user@hive.apache.org<ma...@hive.apache.org>"
>><us...@hive.apache.org>>
>> Subject: Correct syntax for EXPLAIN DEPENDENCY
>>
>> Hi
>> Whats the correct syntax for EXPLAIN DEPENDENCY ?
>>
>> Query
>> ======
>> /usr/lib/hive/bin/hive -e "explain dependency select * from
>>channel_market_lang where channelid > 29000"
>>
>> org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot
>>recognize input near 'plan' 'dependency' 'select' in statement
>>
>>         at
>>org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
>>         at
>>org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
>>         at
>>org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
>>         at
>>org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>>         at
>>org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java
>>:39)
>>         at
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
>>mpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>
>>
>> I was referring to this doc..is there another doc ?
>>
>>https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualE
>>xplain-EXPLAINSyntax
>>
>> Thanks
>> sanjay
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>>intended recipient(s) and may contain confidential and privileged
>>information. Any unauthorized review, use, disclosure or distribution is
>>prohibited. If you are not the intended recipient, please contact the
>>sender by reply email and destroy all copies of the original message
>>along with any attachments, from your computer system. If you are the
>>intended recipient, please be advised that the content of this message
>>is subject to access, review and disclosure by the sender's Email System
>>Administrator.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>>intended recipient(s) and may contain confidential and privileged
>>information. Any unauthorized review, use, disclosure or distribution is
>>prohibited. If you are not the intended recipient, please contact the
>>sender by reply email and destroy all copies of the original message
>>along with any attachments, from your computer system. If you are the
>>intended recipient, please be advised that the content of this message
>>is subject to access, review and disclosure by the sender's Email System
>>Administrator.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Correct syntax for EXPLAIN DEPENDENCY

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Sanjay,
you can upgrade to CDH4.2.0 that contains Hive 0.10.

Jarcec

On Fri, Apr 05, 2013 at 01:48:39AM +0000, Sanjay Subramanian wrote:
> Ah its available only in 0.10.0 :-(
> And I am still using 0.9.x from the CDH4.1.2 distribution
> 
> 
> From: Sanjay Subramanian <sa...@wizecommerce.com>>
> Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
> Date: Thursday, April 4, 2013 6:40 PM
> To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
> Subject: Correct syntax for EXPLAIN DEPENDENCY
> 
> Hi
> Whats the correct syntax for EXPLAIN DEPENDENCY ?
> 
> Query
> ======
> /usr/lib/hive/bin/hive -e "explain dependency select * from channel_market_lang where channelid > 29000"
> 
> org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement
> 
>         at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
>         at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
>         at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> 
> 
> I was referring to this doc..is there another doc ?
> https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax
> 
> Thanks
> sanjay
> 
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
> 
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Correct syntax for EXPLAIN DEPENDENCY

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Ah its available only in 0.10.0 :-(
And I am still using 0.9.x from the CDH4.1.2 distribution

From: Sanjay Subramanian <sa...@wizecommerce.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Thursday, April 4, 2013 6:40 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Correct syntax for EXPLAIN DEPENDENCY

Hi
Whats the correct syntax for EXPLAIN DEPENDENCY ?

Query
======
/usr/lib/hive/bin/hive -e "explain dependency select * from channel_market_lang where channelid > 29000"

org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement

        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

I was referring to this doc..is there another doc ?
https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax

Thanks
sanjay

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Correct syntax for EXPLAIN DEPENDENCY

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi
Whats the correct syntax for EXPLAIN DEPENDENCY ?

Query
======
/usr/lib/hive/bin/hive -e "explain dependency select * from channel_market_lang where channelid > 29000"

org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement

        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)


I was referring to this doc..is there another doc ?
https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax

Thanks
sanjay

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Partition performance

Posted by Ramki Palle <ra...@gmail.com>.

Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.


On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  The slow down is most possibly due to large number of partitions.
> I believe the Hive book authors tell us to be cautious with large number
> of partitions :-)  and I abide by that.
>
>  Users
> Please add your points of view and experiences
>
>  Thanks
> sanjay
>
>   From: Ian <li...@yahoo.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>, Ian <
> liuhy0@yahoo.com>
> Date: Thursday, April 4, 2013 4:01 PM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Partition performance
>
>   Hi,
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.
>
> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.
>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);
>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';
>
> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.
>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);
>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';
>
> When doing a simple query like
>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
> '2013-02-14'
>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?
>
> Thanks.
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Partition performance

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian <li...@yahoo.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>, Ian <li...@yahoo.com>>
Date: Thursday, April 4, 2013 4:01 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions.
        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int);
        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions.
        CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02';

When doing a simple query like
    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.