You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Deepak A <de...@gmail.com> on 2009/07/16 12:25:39 UTC

Selecting data based on the clustered columns

Hi,
I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets
were created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular
'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
only the bucket to which 'UserId' is hashed to?. But, when I run this query,
I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the
buckets. Why can't hive automatically figure out the bucket to which UserId
is hashed to and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak

Re: Selecting data based on the clustered columns

Posted by Deepak A <de...@gmail.com>.

That would be great.!Thanks a lot.

-Deepak

On Fri, Jul 17, 2009 at 11:46 AM, Prasad Chakka <pc...@facebook.com>wrote:

>  Yeah, we know of this optimization and and will add it as we start
> optimizing filter queries using indexes.
>
> ------------------------------
> *From: *Namit Jain <nj...@facebook.com>
> *Reply-To: *<hi...@hadoop.apache.org>
> *Date: *Thu, 16 Jul 2009 22:42:25 -0700
> *To: *<hi...@hadoop.apache.org>
> *Subject: *Re: Selecting data based on the clustered columns
>
>
> I am not sure if they are handling this. Let me talk to Prasad offline and
> get back to you.
>
>
>
> On 7/16/09 9:49 PM, "Deepak A" <de...@gmail.com> wrote:
>
> Hi Namit,
>
> I checked JIRA for any existing tickets on this and figured out that there
> are plans to support indexing on queries. This is being discussed at
> https://issues.apache.org/jira/browse/HIVE-417
>
> Can you please check if what we are discussing makes sense in this content
> or if it is orthogonal to this.
>
> -Deepak
>
> On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <de...@gmail.com> wrote:
>
> Hi Namit,
>
> Thanks a lot on the update.
> Will do that for sure.
>
> -Deepak
>
>
> On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Right now, bucketing information is not used in a lot of places – it is
> only used in sampling.
> For eg:
>
> If your query was:
>
> Select .. From Posts(tablesample 1 out of 256) a;
>
> Then only the first bucket will be scanned.
>
> Your query can be optimized, but currently it is not. Can you file a jira
> on that ?
> It will help us prioritize this.
>
>
>
> -namit
>
>
>
> On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:
>
> Hi,
>
> I have the following table in Hive
> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
> INTO 256 BUCKETS;
>
> Since the data is hash partitioned based on the 'UserId' column, buckets
> were created based on the hash value of 'UserId'.
>
> Now, when I issue a Select query to fetch all the posts by a particular
> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
> I could see all the buckets being searched for the UserId.
>
> Moreover, I see that's there is a way to sample the table based on the
> buckets. Why can't hive automatically figure out the bucket to which UserId
> is hashed to and search only in that bucket?
>
> Can someone clarify me on this?
>
> Thanks,
> Deepak
>
>
>
>
>
>

Re: Selecting data based on the clustered columns

Posted by Prasad Chakka <pc...@facebook.com>.

Yeah, we know of this optimization and and will add it as we start optimizing filter queries using indexes.

________________________________
From: Namit Jain <nj...@facebook.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Thu, 16 Jul 2009 22:42:25 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Selecting data based on the clustered columns

I am not sure if they are handling this. Let me talk to Prasad offline and get back to you.

On 7/16/09 9:49 PM, "Deepak A" <de...@gmail.com> wrote:

Hi Namit,

I checked JIRA for any existing tickets on this and figured out that there are plans to support indexing on queries. This is being discussed at https://issues.apache.org/jira/browse/HIVE-417

Can you please check if what we are discussing makes sense in this content or if it is orthogonal to this.

-Deepak

On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <de...@gmail.com> wrote:
Hi Namit,

Thanks a lot on the update.
Will do that for sure.

-Deepak

On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:
Right now, bucketing information is not used in a lot of places - it is only used in sampling.
For eg:

If your query was:

Select .. From Posts(tablesample 1 out of 256) a;

Then only the first bucket will be scanned.

Your query can be optimized, but currently it is not. Can you file a jira on that ?
It will help us prioritize this.

-namit

On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:

Hi,

I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets were created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan only the bucket to which 'UserId' is hashed to?. But, when I run this query, I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the buckets. Why can't hive automatically figure out the bucket to which UserId is hashed to and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak

Re: Selecting data based on the clustered columns

Posted by Namit Jain <nj...@facebook.com>.

I am not sure if they are handling this. Let me talk to Prasad offline and get back to you.

On 7/16/09 9:49 PM, "Deepak A" <de...@gmail.com> wrote:

Hi Namit,

I checked JIRA for any existing tickets on this and figured out that there are plans to support indexing on queries. This is being discussed at https://issues.apache.org/jira/browse/HIVE-417

Can you please check if what we are discussing makes sense in this content or if it is orthogonal to this.

-Deepak

On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <de...@gmail.com> wrote:
Hi Namit,

Thanks a lot on the update.
Will do that for sure.

-Deepak

On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:
Right now, bucketing information is not used in a lot of places - it is only used in sampling.
For eg:

If your query was:

Select .. From Posts(tablesample 1 out of 256) a;

Then only the first bucket will be scanned.

Your query can be optimized, but currently it is not. Can you file a jira on that ?
It will help us prioritize this.

-namit

On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:

Hi,

I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets were created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan only the bucket to which 'UserId' is hashed to?. But, when I run this query, I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the buckets. Why can't hive automatically figure out the bucket to which UserId is hashed to and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak

Re: Selecting data based on the clustered columns

Posted by Deepak A <de...@gmail.com>.

Hi Namit,
I checked JIRA for any existing tickets on this and figured out that there
are plans to support indexing on queries. This is being discussed at
https://issues.apache.org/jira/browse/HIVE-417

Can you please check if what we are discussing makes sense in this content
or if it is orthogonal to this.

-Deepak

On Thu, Jul 16, 2009 at 10:26 PM, Deepak A <de...@gmail.com> wrote:

> Hi Namit,
> Thanks a lot on the update.
> Will do that for sure.
>
> -Deepak
>
>
> On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:
>
>>  Right now, bucketing information is not used in a lot of places – it is
>> only used in sampling.
>> For eg:
>>
>> If your query was:
>>
>> Select .. From Posts(tablesample 1 out of 256) a;
>>
>> Then only the first bucket will be scanned.
>>
>> Your query can be optimized, but currently it is not. Can you file a jira
>> on that ?
>> It will help us prioritize this.
>>
>>
>>
>> -namit
>>
>>
>>
>> On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have the following table in Hive
>> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY
>> (PostDate) INTO 256 BUCKETS;
>>
>> Since the data is hash partitioned based on the 'UserId' column, buckets
>> were created based on the hash value of 'UserId'.
>>
>> Now, when I issue a Select query to fetch all the posts by a particular
>> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
>> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
>> I could see all the buckets being searched for the UserId.
>>
>> Moreover, I see that's there is a way to sample the table based on the
>> buckets. Why can't hive automatically figure out the bucket to which UserId
>> is hashed to and search only in that bucket?
>>
>> Can someone clarify me on this?
>>
>> Thanks,
>> Deepak
>>
>>
>

Re: Selecting data based on the clustered columns

Posted by Deepak A <de...@gmail.com>.

Hi Namit,
Thanks a lot on the update.
Will do that for sure.

-Deepak

On Thu, Jul 16, 2009 at 7:49 PM, Namit Jain <nj...@facebook.com> wrote:

>  Right now, bucketing information is not used in a lot of places – it is
> only used in sampling.
> For eg:
>
> If your query was:
>
> Select .. From Posts(tablesample 1 out of 256) a;
>
> Then only the first bucket will be scanned.
>
> Your query can be optimized, but currently it is not. Can you file a jira
> on that ?
> It will help us prioritize this.
>
>
>
> -namit
>
>
>
> On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:
>
> Hi,
>
> I have the following table in Hive
> Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate)
> INTO 256 BUCKETS;
>
> Since the data is hash partitioned based on the 'UserId' column, buckets
> were created based on the hash value of 'UserId'.
>
> Now, when I issue a Select query to fetch all the posts by a particular
> 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan
> only the bucket to which 'UserId' is hashed to?. But, when I run this query,
> I could see all the buckets being searched for the UserId.
>
> Moreover, I see that's there is a way to sample the table based on the
> buckets. Why can't hive automatically figure out the bucket to which UserId
> is hashed to and search only in that bucket?
>
> Can someone clarify me on this?
>
> Thanks,
> Deepak
>
>

Re: Selecting data based on the clustered columns

Posted by Namit Jain <nj...@facebook.com>.

Right now, bucketing information is not used in a lot of places - it is only used in sampling.
For eg:

If your query was:

Select .. From Posts(tablesample 1 out of 256) a;

Then only the first bucket will be scanned.

Your query can be optimized, but currently it is not. Can you file a jira on that ?
It will help us prioritize this.



-namit


On 7/16/09 3:25 AM, "Deepak A" <de...@gmail.com> wrote:

Hi,

I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets were created based on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular 'UserId ' (say, Select count(Id) from Posts where UserId=1), does it scan only the bucket to which 'UserId' is hashed to?. But, when I run this query, I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the buckets. Why can't hive automatically figure out the bucket to which UserId is hashed to and search only in that bucket?

Can someone clarify me on this?

Thanks,
Deepak