You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by VJ Anand <vj...@sankia.com> on 2016/12/15 19:54:59 UTC

Re: Hive/TEZ/Parquet

I have a partitioned external Hive table, stored in parquet files. The
partition is by year/month/day/hour/minute. I have two directories - over
two years, and the total number of records is 50Million.  My cluster
configuration is 5 Nodes, with 8 cores and 64GB of RAM - total of 40 cores
and 300GB. I am running Hive using Tez as the engine. I have per container
setting as 4GB and VCore to 1. Additionally, I set the TEZ min input split
to 36MB and also max input Split to the same value 36GB.

When I submit a query Select count(*) from table. I see it allocates 43 Map
Tasks and 1 Reducer task.

I am seeing it takes more than 1hr to complete. Any thoughts, on what could
be the issue or approach that can be taken to improve the performance?

Thanks
VJ

Re: Hive/TEZ/Parquet

Posted by Gopal Vijayaraghavan <go...@apache.org>.

 
> Actually, we don't have that many partitions - there are lot of gaps both in days and time events as well.

Your partition description sounded a lot like one of the FAQs from Mithun's talks, which is why I asked

http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches/24

> But, I would like to understand when you say " time spent might partly be query planning with million partitions"? I presume, this is in producing the physical plan? -- does it spend time in allocating group of partition directories to each map task

Yes, the physical planner is significant overhead, since all map-tasks get a list of all partitions and match each read against that list (partition schema can evolve, this is closer to O(n^2)) & the split-generation is bottlenecked by the total # of files involved in the operation (a slow O(n) operation is still slow).

The overhead in maintaining partitions is fairly high & the entire query planning will try to do something like a du -sh on each partition, if all the basic file statistics are missing etc.

Also, if you have >100k partitions, disabling the metadata optimizer (hive.optimize.metadataonly=false) would be a good thing, since that codepath is single threaded, while a compute-heavy full-table scan is much faster due to parallelism - it might burn more CPU, but it would come back in less than an hour.

You might want to opt for daily partitions & also run the stats gathering ops with "analyze table <table> compute statistics partialscan;" & "analyze table <table> compute statistics for columns" to speed up further queries.

At least in my experience with ORC + valid stats, a query like "select count(*) from table" should takes <500 milliseconds.

Cheers,
Gopal

Re: Hive/TEZ/Parquet

Posted by VJ Anand <vj...@sankia.com>.

Thanks Gopal for the feedback.

Actually, we don't have that many partitions - there are lot of gaps both
in days and time events as well. But, I would like to understand when you
say " time spent might partly be query planning with million partitions"? I
presume, this is in producing the physical plan? -- does it spend time in
allocating group of partition directories to each map task? Can you
elaborate? or point me to any material to better understand this..

Thanks
VJ

On Thu, Dec 15, 2016 at 1:03 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> > The partition is by year/month/day/hour/minute. I have two directories -
> over two years, and the total number of records is 50Million.
>
> That's a million partitions with 50 rows in each of them?
>
> > I am seeing it takes more than 1hr to complete. Any thoughts, on what
> could be the issue or approach that can be taken to improve the performance?
>
> Looks like you have over-partitioned your data massively - the 1 hour
> might be partly query planning with million partitions and the rest might
> be file-count related overheads.
>
> At least in case of ORC, I recommend that the partitions contain at least
> 1 Gb of data & that if you really need to query down to finer levels, to
> use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.
>
> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4
>
> Cheers,
> Gopal
>
>
>

-- 
*VJ Anand*
*Founder *
*Sankia*
vjanand@sankia.com
925-640-1340
www.sankia.com

*Confidentiality Notice*: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies
of the original message

Re: Hive/TEZ/Parquet

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> The partition is by year/month/day/hour/minute. I have two directories - over two years, and the total number of records is 50Million.  

That's a million partitions with 50 rows in each of them?

> I am seeing it takes more than 1hr to complete. Any thoughts, on what could be the issue or approach that can be taken to improve the performance?

Looks like you have over-partitioned your data massively - the 1 hour might be partly query planning with million partitions and the rest might be file-count related overheads.

At least in case of ORC, I recommend that the partitions contain at least 1 Gb of data & that if you really need to query down to finer levels, to use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4

Cheers,
Gopal