You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Robin Verlangen <ro...@us2.nl> on 2012/09/20 13:36:17 UTC

Hive ignoring buckets when using dynamic where

Hi there,

We're working on some queries that use buckets to improve performance with
like 1000x. However we ran into a problem. When we use a fixed hardcoded
date it works fine:

SELECT * FROM standard_feed WHERE bdate='2012-09-01'
*Starts a job with 6 mappers, 2 reducers*

When we use it dynamically:
SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
*Starts a job with 1000 mappers, 2 reducers*
*
*
What's the problem here? The result of the to_date of the current timestamp
should be equal to a normal fixed date? Does anyone have a solution?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

Re: Hive ignoring buckets when using dynamic where

Posted by Robin Verlangen <ro...@us2.nl>.

@Jamie:

I was trying this too with a view, like:
DROP VIEW IF EXISTS standard_today; CREATE VIEW standard_today AS SELECT *
FROM standard_feed WHERE bdate='2012-09-21';

However when I run select * FROM standard_today it starts iterating over
all data again.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/20 Jamie Olson <ja...@gmail.com>

> This might be somewhat off-topic, but is the query optimizer smart enough
> to conditions like 'date2 = date_sub(date1,1)' execute efficiently?  I
> would certainly hope so when date1 is a constant, but what if this is a
> join condition?  If date is a partition or clustered column, will hive do
> the right thing?
>
> Jamie Olson
>
>
>
> On Thu, Sep 20, 2012 at 10:27 AM, Bejoy KS <be...@yahoo.com> wrote:
>
>> Hi Robin
>>
>> AFAIK, one of the possible solutions is is to take the computation of
>> current date out of hive, may be to a shell script (you can have the
>> computation current - x days aslo from within a shell script i guess ).
>> Once you have identified the required date, store it in a variable and
>> substitute the same in your hive query.
>>
>> Some links for your reference
>> http://hive.apache.org/docs/r0.9.0/language_manual/var_substitution.html
>>
>> http://kickstarthadoop.blogspot.in/2011/10/include-values-during-execution-time-in.html
>>
>>
>> Regards,
>> Bejoy KS
>>
>>   ------------------------------
>> *From:* Robin Verlangen <ro...@us2.nl>
>> *To:* user@hive.apache.org; Bejoy KS <be...@yahoo.com>
>> *Sent:* Thursday, September 20, 2012 7:30 PM
>> *Subject:* Re: Hive ignoring buckets when using dynamic where
>>
>> Hi Bejoy,
>>
>> Thank you for your reply. Is there any way to fix my problem? I want to
>> have a query that has a dynamic range, from now (and in some cases now - x
>> days until now).
>>
>> Best regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E robin@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/9/20 Bejoy KS <be...@yahoo.com>
>>
>> Hi Robin
>>
>> The result of 'bdate=to_date(unix_timestamp())' is evaluated during the
>> runtime of the query. But the data that a query should process is
>> determined initially before executing the map reduce jobs. That is the
>> reason the query is running over whole data set.
>>
>> When you provide 'bdate='2012-09-01'' the hive parser knows initially
>> itself what data which all partitions should be taken into account. So this
>> query runs on only the required partitions and not on whole data.
>>
>> To add on , it is not the buckets considered here on where clause but the
>> partitions.
>>
>> Regards,
>> Bejoy KS
>>
>>   ------------------------------
>> *From:* Robin Verlangen <ro...@us2.nl>
>> *To:* user@hive.apache.org
>> *Sent:* Thursday, September 20, 2012 5:06 PM
>> *Subject:* Hive ignoring buckets when using dynamic where
>>
>> Hi there,
>>
>> We're working on some queries that use buckets to improve performance
>> with like 1000x. However we ran into a problem. When we use a fixed
>> hardcoded date it works fine:
>>
>>  SELECT * FROM standard_feed WHERE bdate='2012-09-01'
>> *Starts a job with 6 mappers, 2 reducers*
>>
>> When we use it dynamically:
>> SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
>> *Starts a job with 1000 mappers, 2 reducers*
>> *
>> *
>> What's the problem here? The result of the to_date of the current
>> timestamp should be equal to a normal fixed date? Does anyone have a
>> solution?
>>
>> Best regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E robin@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>>
>>
>>
>>
>

Re: Hive ignoring buckets when using dynamic where

Posted by Jamie Olson <ja...@gmail.com>.

This might be somewhat off-topic, but is the query optimizer smart enough
to conditions like 'date2 = date_sub(date1,1)' execute efficiently?  I
would certainly hope so when date1 is a constant, but what if this is a
join condition?  If date is a partition or clustered column, will hive do
the right thing?

Jamie Olson


On Thu, Sep 20, 2012 at 10:27 AM, Bejoy KS <be...@yahoo.com> wrote:

> Hi Robin
>
> AFAIK, one of the possible solutions is is to take the computation of
> current date out of hive, may be to a shell script (you can have the
> computation current - x days aslo from within a shell script i guess ).
> Once you have identified the required date, store it in a variable and
> substitute the same in your hive query.
>
> Some links for your reference
> http://hive.apache.org/docs/r0.9.0/language_manual/var_substitution.html
>
> http://kickstarthadoop.blogspot.in/2011/10/include-values-during-execution-time-in.html
>
>
> Regards,
> Bejoy KS
>
>   ------------------------------
> *From:* Robin Verlangen <ro...@us2.nl>
> *To:* user@hive.apache.org; Bejoy KS <be...@yahoo.com>
> *Sent:* Thursday, September 20, 2012 7:30 PM
> *Subject:* Re: Hive ignoring buckets when using dynamic where
>
> Hi Bejoy,
>
> Thank you for your reply. Is there any way to fix my problem? I want to
> have a query that has a dynamic range, from now (and in some cases now - x
> days until now).
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/20 Bejoy KS <be...@yahoo.com>
>
> Hi Robin
>
> The result of 'bdate=to_date(unix_timestamp())' is evaluated during the
> runtime of the query. But the data that a query should process is
> determined initially before executing the map reduce jobs. That is the
> reason the query is running over whole data set.
>
> When you provide 'bdate='2012-09-01'' the hive parser knows initially
> itself what data which all partitions should be taken into account. So this
> query runs on only the required partitions and not on whole data.
>
> To add on , it is not the buckets considered here on where clause but the
> partitions.
>
> Regards,
> Bejoy KS
>
>   ------------------------------
> *From:* Robin Verlangen <ro...@us2.nl>
> *To:* user@hive.apache.org
> *Sent:* Thursday, September 20, 2012 5:06 PM
> *Subject:* Hive ignoring buckets when using dynamic where
>
> Hi there,
>
> We're working on some queries that use buckets to improve performance with
> like 1000x. However we ran into a problem. When we use a fixed hardcoded
> date it works fine:
>
>  SELECT * FROM standard_feed WHERE bdate='2012-09-01'
> *Starts a job with 6 mappers, 2 reducers*
>
> When we use it dynamically:
> SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
> *Starts a job with 1000 mappers, 2 reducers*
> *
> *
> What's the problem here? The result of the to_date of the current
> timestamp should be equal to a normal fixed date? Does anyone have a
> solution?
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
>
>
>
>

Re: Hive ignoring buckets when using dynamic where

Posted by Bejoy KS <be...@yahoo.com>.

Hi Robin

AFAIK, one of the possible solutions is is to take the computation of current date out of hive, may be to a shell script (you can have the computation current - x days aslo from within a shell script i guess ). Once you have identified the required date, store it in a variable and substitute the same in your hive query.  

Some links for your reference
http://hive.apache.org/docs/r0.9.0/language_manual/var_substitution.html 
http://kickstarthadoop.blogspot.in/2011/10/include-values-during-execution-time-in.html


 

Regards,
Bejoy KS


________________________________
 From: Robin Verlangen <ro...@us2.nl>
To: user@hive.apache.org; Bejoy KS <be...@yahoo.com> 
Sent: Thursday, September 20, 2012 7:30 PM
Subject: Re: Hive ignoring buckets when using dynamic where
 

Hi Bejoy,

Thank you for your reply. Is there any way to fix my problem? I want to have a query that has a dynamic range, from now (and in some cases now - x days until now).

Best regards, 


Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.



2012/9/20 Bejoy KS <be...@yahoo.com>

Hi Robin
>
>
>The result of 'bdate=to_date(unix_timestamp())' is evaluated during the runtime of the query. But the data that a query should process is determined initially before executing the map reduce jobs. That is the reason the query is running over whole data set.
>
>
>When you provide 'bdate='2012-09-01'' the hive parser knows initially itself what data which all partitions should be taken into account. So this query runs on only the required partitions and not on whole data.
>
>
>To add on , it is not the buckets considered here on where clause but the partitions.  
> 
>Regards,
>Bejoy KS
>
>
>
>________________________________
> From: Robin Verlangen <ro...@us2.nl>
>To: user@hive.apache.org 
>Sent: Thursday, September 20, 2012 5:06 PM
>Subject: Hive ignoring buckets when using dynamic where
> 
>
>
>Hi there,
>
>
>We're working on some queries that use buckets to improve performance with like 1000x. However we ran into a problem. When we use a fixed hardcoded date it works fine:
>
>
>SELECT * FROM standard_feed WHERE bdate='2012-09-01'
>Starts a job with 6 mappers, 2 reducers
>
>
>When we use it dynamically:
>SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
>Starts a job with 1000 mappers, 2 reducers
>
>
>What's the problem here? The result of the to_date of the current timestamp should be equal to a normal fixed date? Does anyone have a solution?
>
>Best regards, 
>
>
>
>Robin Verlangen
>Software engineer
>
>
>W http://www.robinverlangen.nl
>E robin@us2.nl
>
>
>Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
>
>
>

Re: Hive ignoring buckets when using dynamic where

Posted by Robin Verlangen <ro...@us2.nl>.

Hi Bejoy,

Thank you for your reply. Is there any way to fix my problem? I want to
have a query that has a dynamic range, from now (and in some cases now - x
days until now).

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/20 Bejoy KS <be...@yahoo.com>

> Hi Robin
>
> The result of 'bdate=to_date(unix_timestamp())' is evaluated during the
> runtime of the query. But the data that a query should process is
> determined initially before executing the map reduce jobs. That is the
> reason the query is running over whole data set.
>
> When you provide 'bdate='2012-09-01'' the hive parser knows initially
> itself what data which all partitions should be taken into account. So this
> query runs on only the required partitions and not on whole data.
>
> To add on , it is not the buckets considered here on where clause but the
> partitions.
>
> Regards,
> Bejoy KS
>
>   ------------------------------
> *From:* Robin Verlangen <ro...@us2.nl>
> *To:* user@hive.apache.org
> *Sent:* Thursday, September 20, 2012 5:06 PM
> *Subject:* Hive ignoring buckets when using dynamic where
>
> Hi there,
>
> We're working on some queries that use buckets to improve performance with
> like 1000x. However we ran into a problem. When we use a fixed hardcoded
> date it works fine:
>
> SELECT * FROM standard_feed WHERE bdate='2012-09-01'
> *Starts a job with 6 mappers, 2 reducers*
>
> When we use it dynamically:
> SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
> *Starts a job with 1000 mappers, 2 reducers*
> *
> *
> What's the problem here? The result of the to_date of the current
> timestamp should be equal to a normal fixed date? Does anyone have a
> solution?
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
>

Re: Hive ignoring buckets when using dynamic where

Posted by Bejoy KS <be...@yahoo.com>.

Hi Robin

The result of 'bdate=to_date(unix_timestamp())' is evaluated during the runtime of the query. But the data that a query should process is determined initially before executing the map reduce jobs. That is the reason the query is running over whole data set.

When you provide 'bdate='2012-09-01'' the hive parser knows initially itself what data which all partitions should be taken into account. So this query runs on only the required partitions and not on whole data.

To add on , it is not the buckets considered here on where clause but the partitions.  
 
Regards,
Bejoy KS


________________________________
 From: Robin Verlangen <ro...@us2.nl>
To: user@hive.apache.org 
Sent: Thursday, September 20, 2012 5:06 PM
Subject: Hive ignoring buckets when using dynamic where
 

Hi there,

We're working on some queries that use buckets to improve performance with like 1000x. However we ran into a problem. When we use a fixed hardcoded date it works fine:

SELECT * FROM standard_feed WHERE bdate='2012-09-01'
Starts a job with 6 mappers, 2 reducers

When we use it dynamically:
SELECT * FROM standard_feed WHERE bdate=to_date(unix_timestamp())
Starts a job with 1000 mappers, 2 reducers

What's the problem here? The result of the to_date of the current timestamp should be equal to a normal fixed date? Does anyone have a solution?

Best regards, 


Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.