You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by rk vishu <ta...@gmail.com> on 2012/02/18 10:39:48 UTC

Re: better partitioning strategy in hive

> Hello All,
>
> We have a hive table partitioned by date and hour(330 columns). We have 5
> years worth of data for the table. Each hourly partition have around 800MB.
> So total 43,800 partitions with one file per partition.
>
> When we run select count(*) from table, hive is taking for ever to submit
> the job. I waited for 20 min and killed it. If i run for a month it takes
> little time to submit the job, but at least hive is able to get the work
> done?.
>
> Questions:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
> 4)Are there any special input formats or tricks to handle this?
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
>
>
> -RK
>
>
>
>
>
>

Re: better partitioning strategy in hive

Posted by Ravikumar MAV <ra...@gmail.com>.

Thank you very much for the reply. It helps.

As you mensioned in one of the points, I tried daily partitions with Snappy
Seq file Compression. That performed much better than any other option. I
am able to run medium complex queries on 25TB data (size when uncompressed)
and was able to see the results under 20Min.

Thanks and Regards
Ravi

On Fri, Mar 2, 2012 at 8:20 AM, Mark Grover <mg...@oanda.com> wrote:

> Sorry about the dealyed response, RK.
>
> Here is what I think:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
>
> --> Hive is possibly taking a long time to figure out what partitions it
> needs to query. I experienced the same problem when I had a lot of
> partitions (with relatively small sized files). I reverted back to having
> less number of partitions with larger file sizes, that fixed the problem.
> Finding the balance between how many partitions you want and how big you
> want each partition to be is tricky, but, in general, it's better to have
> lesser number of partitions. You want to be aware of the small files
> problem. It has been discussed at many places. Some links are:
>
> http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html
>
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
>
> --> I would say that's a good step forward.
>
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
>
> --> You probably wouldn't have the same issue; if you still do, it
> wouldn't be as bad. Since the number of partitions have been reduced by a
> factor of 24, hive doesn't have to go through as many number of partitions.
> However, your queries that look for data in a particular hour on a given
> day would be slower now that you don't have hour as a partition.
>
> 4)Are there any special input formats or tricks to handle this?
>
> --> This is a separate question. What format, SerDe and compression you
> use for your data, is a part of the design but isn't necessarily linked to
> the problem in question.
>
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
>
> --> mapred.reduce.tasks wouldn't help because the job is map-only and has
> no reduce tasks. You should look into hive.merge.* properties. Setting them
> in your hive-site.xml would do the trick. You can see refer to this
> template (
> https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template)
> to see what properties exist.
>
> Good luck!
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgrover@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "rk vishu" <ta...@gmail.com>
> To: cdh-user@cloudera.org, common-user@hadoop.apache.org,
> user@hive.apache.org
> Sent: Saturday, February 18, 2012 4:39:48 AM
> Subject: Re: better partitioning strategy in hive
>
>
>
>
>
> Hello All,
>
> We have a hive table partitioned by date and hour(330 columns). We have 5
> years worth of data for the table. Each hourly partition have around 800MB.
> So total 43,800 partitions with one file per partition.
>
> When we run select count(*) from table, hive is taking for ever to submit
> the job. I waited for 20 min and killed it. If i run for a month it takes
> little time to submit the job, but at least hive is able to get the work
> done?.
>
> Questions:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
> 4)Are there any special input formats or tricks to handle this?
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
>
>
> -RK
>
>
>
>
>
>
>


-- 
- Ravi MAV

Re: better partitioning strategy in hive

Posted by Mark Grover <mg...@oanda.com>.

Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 

--> Hive is possibly taking a long time to figure out what partitions it needs to query. I experienced the same problem when I had a lot of partitions (with relatively small sized files). I reverted back to having less number of partitions with larger file sizes, that fixed the problem. Finding the balance between how many partitions you want and how big you want each partition to be is tricky, but, in general, it's better to have lesser number of partitions. You want to be aware of the small files problem. It has been discussed at many places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 

--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be as bad. Since the number of partitions have been reduced by a factor of 24, hive doesn't have to go through as many number of partitions. However, your queries that look for data in a particular hour on a given day would be slower now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for your data, is a part of the design but isn't necessarily linked to the problem in question.

5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no reduce tasks. You should look into hive.merge.* properties. Setting them in your hive-site.xml would do the trick. You can see refer to this template (https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "rk vishu" <ta...@gmail.com>
To: cdh-user@cloudera.org, common-user@hadoop.apache.org, user@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 


-RK

Re: better partitioning strategy in hive

Posted by Mark Grover <mg...@oanda.com>.

Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 

--> Hive is possibly taking a long time to figure out what partitions it needs to query. I experienced the same problem when I had a lot of partitions (with relatively small sized files). I reverted back to having less number of partitions with larger file sizes, that fixed the problem. Finding the balance between how many partitions you want and how big you want each partition to be is tricky, but, in general, it's better to have lesser number of partitions. You want to be aware of the small files problem. It has been discussed at many places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 

--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be as bad. Since the number of partitions have been reduced by a factor of 24, hive doesn't have to go through as many number of partitions. However, your queries that look for data in a particular hour on a given day would be slower now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for your data, is a part of the design but isn't necessarily linked to the problem in question.

5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no reduce tasks. You should look into hive.merge.* properties. Setting them in your hive-site.xml would do the trick. You can see refer to this template (https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "rk vishu" <ta...@gmail.com>
To: cdh-user@cloudera.org, common-user@hadoop.apache.org, user@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 


-RK