You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Stephen Boesch <ja...@gmail.com> on 2015/02/24 07:26:02 UTC

Select distinct on partitioned column requires reading all the files?

When querying a hive table according to a partitioning column, it would be
logical that a simple

select count(distinct partitioned_column_name) from my_partitioned_table

would complete almost instantaneously.

But we are seeing that both hive and impala are unable to execute this
query properly: they just read the entire table!

What do we need to do to ensure the above command executes rapidly?

Re: Select distinct on partitioned column requires reading all the files?

Posted by Stephen Boesch <ja...@gmail.com>.
Great thanks.  Is this a server-side-only /requires restart parameter?

2015-02-23 22:36 GMT-08:00 Gopal Vijayaraghavan <go...@apache.org>:

> Hi,
>
> Are you sure you have
>
> hive.optimize.metadataonly=true ?
>
> I’m not saying it will complete instantaneously (possibly even be very
> slow, due to the lack of a temp-table optimization of that), but it won’t
> read any part of the actual table.
>
> Cheers,
> Gopal
>
> From: Stephen Boesch <ja...@gmail.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> Date: Monday, February 23, 2015 at 10:26 PM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Select distinct on partitioned column requires reading all the
> files?
>
>
> When querying a hive table according to a partitioning column, it would be
> logical that a simple
>
> select count(distinct partitioned_column_name) from my_partitioned_table
>
> would complete almost instantaneously.
>
> But we are seeing that both hive and impala are unable to execute this
> query properly: they just read the entire table!
>
> What do we need to do to ensure the above command executes rapidly?
>

Re: Select distinct on partitioned column requires reading all the files?

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

Are you sure you have

hive.optimize.metadataonly=true ?

I¹m not saying it will complete instantaneously (possibly even be very slow,
due to the lack of a temp-table optimization of that), but it won¹t read any
part of the actual table.

Cheers,
Gopal

From:  Stephen Boesch <ja...@gmail.com>
Reply-To:  "user@hive.apache.org" <us...@hive.apache.org>
Date:  Monday, February 23, 2015 at 10:26 PM
To:  "user@hive.apache.org" <us...@hive.apache.org>
Subject:  Select distinct on partitioned column requires reading all the
files?


When querying a hive table according to a partitioning column, it would be
logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query
properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?