You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by md...@orange.com on 2012/01/26 12:23:32 UTC

Last value for a column

Dear all,
I am struggling with a Hive query where I am trying to get the last value for a column.

Let say I have a table T with three columns: user_id, time, colour and I want to know for each user_id what is its last colour value.

At the moment I am using the following (naïve) query:

SELECT T1.user_id, T1.lastTime, T2.colour FROM (
              SELECT user_id, max(Time) AS lastTime FROM T GROUP BY user_id ) T1
             JOIN T T2 ON (T1.user_id=T2.user_id AND T2.time=T1.lastTime)

I am not happy with this query of course since I have to read the table T twice.

So I have three questions:

1)      Is there a way to re-write this query in a more efficient way ?



2)      Wouldn't that be better to  stream the output of a distributed by on user_id into a python script and print out only one row per user_id  (assuming  sorted by user_id and time) ?





3)      Couldn't I create the table T with cluster by on user_id and sorted by time at the first place and then use this structure to extract the rows I am interested in ?


Many thanks for your help,
Michael

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified.
Thank you.

Re: Last value for a column

Posted by Igor Tatarinov <ig...@decide.com>.

I don't think there is a better way to implement your query using the
standard SQL/Hive.

A python reducer (or a java UDF) is the way to go.

I don't think clustering would help since there is no way to specify what
you want in HiveQL alone.

igor
decide.com

On Thu, Jan 26, 2012 at 3:23 AM, <md...@orange.com> wrote:

> Dear all,****
>
> I am struggling with a Hive query where I am trying to get the last value
> for a column.****
>
> ** **
>
> Let say I have a table T with three columns: user_id, time, colour and I
> want to know for each user_id what is its last colour value.****
>
> ** **
>
> At the moment I am using the following (naïve) query:****
>
> ** **
>
> SELECT T1.user_id, T1.lastTime, T2.colour FROM (****
>
>               SELECT user_id, max(Time) AS lastTime FROM T GROUP BY
> user_id ) T1****
>
>              JOIN T T2 ON (T1.user_id=T2.user_id AND T2.time=T1.lastTime)*
> ***
>
>                                        ****
>
> I am not happy with this query of course since I have to read the table T
> twice. ****
>
> ** **
>
> So I have three questions:****
>
> **1)      **Is there a way to re-write this query in a more efficient way
> ?****
>
> ** **
>
> **2)      **Wouldn’t that be better to  stream the output of a
> distributed by on user_id into a python script and print out only one row
> per user_id  (assuming  sorted by user_id and time) ?****
>
> ** **
>
> ** **
>
> **3)      **Couldn’t I create the table T with cluster by on user_id and
> sorted by time at the first place and then use this structure to extract
> the rows I am interested in ?****
>
> ** **
>
> ** **
>
> Many thanks for your help,****
>
> Michael****
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
> France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci
>
> This message and its attachments may contain confidential or privileged information that may be protected by law;
> they should not be distributed, used or copied without authorization.
> If you have received this email in error, please notify the sender and delete this message and its attachments.
> As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified.
> Thank you.
>
>

RE: Last value for a column

Posted by Steven Wong <sw...@netflix.com>.

Other than writing a custom UDAF or TRANSFORM script, a somewhat ugly way is something like:

SELECT user_id, split(max(concat(time, '_', colour)), '_')[1]
FROM T
GROUP BY user_id

From: mdefoinplatel.ext@orange.com [mailto:mdefoinplatel.ext@orange.com]
Sent: Thursday, January 26, 2012 3:24 AM
To: user@hive.apache.org
Subject: Last value for a column

Dear all,
I am struggling with a Hive query where I am trying to get the last value for a column.

Let say I have a table T with three columns: user_id, time, colour and I want to know for each user_id what is its last colour value.

At the moment I am using the following (naïve) query:

SELECT T1.user_id, T1.lastTime, T2.colour FROM (
              SELECT user_id, max(Time) AS lastTime FROM T GROUP BY user_id ) T1
             JOIN T T2 ON (T1.user_id=T2.user_id AND T2.time=T1.lastTime)

I am not happy with this query of course since I have to read the table T twice.

So I have three questions:

1)      Is there a way to re-write this query in a more efficient way ?

2)      Wouldn't that be better to  stream the output of a distributed by on user_id into a python script and print out only one row per user_id  (assuming  sorted by user_id and time) ?

3)      Couldn't I create the table T with cluster by on user_id and sorted by time at the first place and then use this structure to extract the rows I am interested in ?

Many thanks for your help,
Michael

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc

pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler

a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,

France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci

This message and its attachments may contain confidential or privileged information that may be protected by law;

they should not be distributed, used or copied without authorization.

If you have received this email in error, please notify the sender and delete this message and its attachments.

As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified.

Thank you.

Re: Last value for a column

Posted by ameet chaubal <am...@yahoo.com>.

Just recently, a new way of doing windowing functionality was posted at:
https://github.com/hbutani/SQLWindowing 

This is quite comprehensive and includes about 16 functions.
This is an approach to solve HIVE-896 which is the issue about Lag/Lead etc functions.
There is a detailed document about the approach and code.

 
Sincerely, Ameet


________________________________
 From: "mdefoinplatel.ext@orange.com" <md...@orange.com>
To: "user@hive.apache.org" <us...@hive.apache.org> 
Sent: Thursday, January 26, 2012 6:23 AM
Subject: Last value for a column
 

Dear all,
I am struggling with a Hive query where I am trying to get the last value for a column.
 
Let say I have a table T with three columns: user_id, time, colour and I want to know for each user_id what is its last colour value.
 
At the moment I am using the following (naïve) query:
 
SELECT T1.user_id, T1.lastTime, T2.colour FROM (
              SELECT user_id, max(Time) AS lastTime FROM T GROUP BY user_id ) T1
             JOIN T T2 ON (T1.user_id=T2.user_id AND T2.time=T1.lastTime)
                                       
I am not happy with this query of course since I have to read the table T twice. 
 
So I have three questions:
1)      Is there a way to re-write this query in a more efficient way ?
 
2)      Wouldn’t that be better to  stream the output of a distributed by on user_id into a python script and print out only one row per user_id  (assuming  sorted by user_id and time) ?
 
 
3)      Couldn’t I create the table T with cluster by on user_id and sorted by time at the first place and then use this structure to extract the rows I am interested in ?
 
 
Many thanks for your help,
Michael
_________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified.
Thank you.