You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by comptech geeky <co...@gmail.com> on 2012/07/19 21:59:55 UTC

Something wrong with my query to get TOP 3?

This is the below data in my Table1


BID       PID                       TIME
--------------+-------------------------+--------------------------------
1345653   330760137950       2012-07-09 21:42:29
1345653   330760137950       2012-07-09 21:43:29
1345653   330760137950       2012-07-09 21:40:29
1345653   330760137950       2012-07-09 21:41:29
1345653   110909316904       2012-07-09 21:29:06
1345653   221065796761       2012-07-09 19:31:48

So If I need to clarify the above scenario- I have data in above table like
this-
For USER *`1345653` *I have this PID `*330760137950` *four times but with
different timestamps in red color. So I need the output something like this-

Output that I need:-

*1345653    330760137950       2012-07-09 21:43:29 *
*1345653    330760137950       2012-07-09 21:42:29 *
*1345653    330760137950       2012-07-09 21:41:29*
1345653    110909316904       2012-07-09 21:29:06
1345653    221065796761       2012-07-09 19:31:48

So Basically If BID and PID are same but with different timestamps, then I
need TOP 3 sorted with TIME in descending order

And for this I created rank UDF (User Defined Function). And I wrote the
below query but its not working for me. Can anyone help me on this?
*
*
*
*
*SELECT buyer_id, item_id, created_time*
*    FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) =
'2012-07-09' AND rank(buyer_id) < 3*
*    DISTRIBUTE BY buyer_id*
*    SORT BY buyer_id, created_time desc*

Re: Something wrong with my query to get TOP 3?

Posted by Jasper Knulst <ja...@incentro.com>.
Hi,

I more or less had the same problem and finally got it down by introducing
a second subquery. This will guarantee that the rank function is invoked
on the reduce phase and that the rank results are properly sorted.


I guess something like this:

*SELECT bid, pid, rank FROM *
      *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
*
  *
           ( SELECT bid, pid, time FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09' *
*            DISTRIBUTE BY bid,pid*
*            SORT BY bid,pid, time desc) A
*
*
      ) B
WHERE rank < 3;*

The A. query is carried out at the map phase and makes sure the record
arrive in the proper order at the reducer
The B. query takes care of applying the rank function (aliased as rank)
The outer most query (a second MR job) takes care of filtering the TOP-3

Maybe this will not run at once but hope you get the idea for the proper
set up. I modelled it after a query that I had to run, which finally
returned the proper TOP-10 for me.


Jasper



2012/7/20 comptech geeky <co...@gmail.com>

> Can anyone help me with this? I have tried other options by tweaking the
> query also. I am not able to achieve my expected output.
>
>
>
> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:
>
>> Modified Query that I wrote and its not working as expected output is.
>>
>> *
>> *
>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>> *FROM (*
>> *    SELECT bid, pid, time*
>> *    FROM table1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>> '2012-07-09'*
>> *    DISTRIBUTE BY bid,pid,time*
>> *    SORT BY bid, time desc*
>> *) a*
>> *WHERE rank(bid) < 3;*
>>
>>
>>
>>
>>
>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> I wrote this query after modifying it-
>>>
>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>> UNIX_TIMESTAMP(created_time)*
>>> *FROM (*
>>> *    SELECT buyer_id, item_id, created_time*
>>> *    FROM testingtable1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>> int))) = '2012-07-09'*
>>> *    DISTRIBUTE BY buyer_id,item_id*
>>> *    SORT BY buyer_id, created_time desc*
>>> *) a*
>>> *WHERE rank(buyer_id) < 3;*
>>>
>>> And the output I got is which is sligtly wrong as it is missing two rows-
>>>
>>> *1345653    330760137950       2012-07-09 21:43:29*
>>> *1345653    330760137950       2012-07-09 21:42:29*
>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>
>>> These two rows are missing-
>>>
>>> 1345653    110909316904       2012-07-09 21:29:06
>>> 1345653    221065796761       2012-07-09 19:31:48
>>>
>>>  So full output should be like this-
>>>
>>> *1345653    330760137950       2012-07-09 21:43:29*
>>> *1345653    330760137950       2012-07-09 21:42:29*
>>> *1345653    330760137950       2012-07-09 21:41:29*
>>> 1345653    110909316904       2012-07-09 21:29:06
>>> 1345653    221065796761       2012-07-09 19:31:48
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> Can you show me the exact query that I need to do for this particular
>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>> to HiveQL.
>>>>
>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>> timestamp.
>>>>
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>> philip.j.tromans@gmail.com> wrote:
>>>>
>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>
>>>>> Phil.
>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This is the below data in my Table1
>>>>>>
>>>>>>
>>>>>> BID       PID                       TIME
>>>>>>
>>>>>> --------------+-------------------------+--------------------------------
>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>> table like this-
>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>>>> with different timestamps in red color. So I need the output something like
>>>>>> this-
>>>>>>
>>>>>> Output that I need:-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>
>>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>> *    FROM table1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>> int))) = '2012-07-09'*
>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Can anyone help me with this? I have tried other options by tweaking the
query also. I am not able to achieve my expected output.



On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:

> Modified Query that I wrote and its not working as expected output is.
>
> *
> *
> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
> *FROM (*
> *    SELECT bid, pid, time*
> *    FROM table1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
> '2012-07-09'*
> *    DISTRIBUTE BY bid,pid,time*
> *    SORT BY bid, time desc*
> *) a*
> *WHERE rank(bid) < 3;*
>
>
>
>
>
> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:
>
>> I wrote this query after modifying it-
>>
>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>> UNIX_TIMESTAMP(created_time)*
>> *FROM (*
>> *    SELECT buyer_id, item_id, created_time*
>> *    FROM testingtable1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>> int))) = '2012-07-09'*
>> *    DISTRIBUTE BY buyer_id,item_id*
>> *    SORT BY buyer_id, created_time desc*
>> *) a*
>> *WHERE rank(buyer_id) < 3;*
>>
>> And the output I got is which is sligtly wrong as it is missing two rows-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>>
>> These two rows are missing-
>>
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>  So full output should be like this-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>
>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Can you show me the exact query that I need to do for this particular
>>> problem consideing my scenario? It will be of great help to me. As I am new
>>> to HiveQL.
>>>
>>> I need TOP 3 for those if BID and PID gets matched but with different
>>> timestamp.
>>>
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>> philip.j.tromans@gmail.com> wrote:
>>>
>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>
>>>> Phil.
>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is the below data in my Table1
>>>>>
>>>>>
>>>>> BID       PID                       TIME
>>>>>
>>>>> --------------+-------------------------+--------------------------------
>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So If I need to clarify the above scenario- I have data in above table
>>>>> like this-
>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>>> with different timestamps in red color. So I need the output something like
>>>>> this-
>>>>>
>>>>> Output that I need:-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>
>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>> *
>>>>> *
>>>>> *
>>>>> *
>>>>> *SELECT buyer_id, item_id, created_time*
>>>>> *    FROM table1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>> int))) = '2012-07-09'*
>>>>> *    DISTRIBUTE BY buyer_id*
>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>
>>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by Jasper Knulst <ja...@incentro.com>.
Hi,

I more or less had the same problem and finally got it down by introducing
a second subquery. This will guarantee that the rank function is invoked
on the reduce phase and that the rank results are properly sorted.


I guess something like this:

*SELECT bid, pid, rank FROM *
      *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
*
  *
           ( SELECT bid, pid, time FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09' *
*            DISTRIBUTE BY bid,pid*
*            SORT BY bid,pid, time desc) A
*
*
      ) B
WHERE rank < 3;*

The A. query is carried out at the map phase and makes sure the record
arrive in the proper order at the reducer
The B. query takes care of applying the rank function (aliased as rank)
The outer most query (a second MR job) takes care of filtering the TOP-3

Maybe this will not run at once but hope you get the idea for the proper
set up. I modelled it after a query that I had to run, which finally
returned the proper TOP-10 for me.





Jasper

>
>
>
> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:
>
>> I wrote this query after modifying it-
>>
>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>> UNIX_TIMESTAMP(created_time)*
>> *FROM (*
>> *    SELECT buyer_id, item_id, created_time*
>> *    FROM testingtable1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>> int))) = '2012-07-09'*
>> *    DISTRIBUTE BY buyer_id,item_id*
>> *    SORT BY buyer_id, created_time desc*
>> *) a*
>> *WHERE rank(buyer_id) < 3;*
>>
>> And the output I got is which is sligtly wrong as it is missing two rows-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>>
>> These two rows are missing-
>>
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>  So full output should be like this-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>
>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Can you show me the exact query that I need to do for this particular
>>> problem consideing my scenario? It will be of great help to me. As I am new
>>> to HiveQL.
>>>
>>> I need TOP 3 for those if BID and PID gets matched but with different
>>> timestamp.
>>>
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>> philip.j.tromans@gmail.com> wrote:
>>>
>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>
>>>> Phil.
>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is the below data in my Table1
>>>>>
>>>>>
>>>>> BID       PID                       TIME
>>>>>
>>>>> --------------+-------------------------+--------------------------------
>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So If I need to clarify the above scenario- I have data in above table
>>>>> like this-
>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>>> with different timestamps in red color. So I need the output something like
>>>>> this-
>>>>>
>>>>> Output that I need:-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>
>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>> *
>>>>> *
>>>>> *
>>>>> *
>>>>> *SELECT buyer_id, item_id, created_time*
>>>>> *    FROM table1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>> int))) = '2012-07-09'*
>>>>> *    DISTRIBUTE BY buyer_id*
>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>
>>>>>
>>>
>>
>


--

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Thanks Jasper for replying back. I have mentioned my use case in my first
email. And also I have already wrote the HiveQL query with the rank
function working but it is not giving me the exact output that I am
supposed to get from the query.




On Thu, Jul 19, 2012 at 3:53 PM, Jasper Knulst
<ja...@incentro.com>wrote:

> I am not really aware of your use case.
> Play around with it. At least the rank function is now properly applied.
>
>
> Maybe, remove pid from the DISTRIBUTE and de SORT clauses ??
>
> Jasper
>
>
>
>
> 2012/7/20 comptech geeky <co...@gmail.com>
>
>> Hi Igor,
>>
>> I am not sure what I have to remove from Distribute By as in distribute
>> by we have bid, pid and you said remove bid and time from distribute by and
>> it doesn't have time
>>
>> *SELECT bid, pid, rank FROM *
>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>> FROM
>> *
>>   *
>>            ( SELECT bid, pid, time FROM table1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>> '2012-07-09' *
>> *            DISTRIBUTE BY bid,pid*
>> *            SORT BY bid,pid, time desc) A
>> *
>> *
>>       ) B
>> *
>> *WHERE rank < 3;*
>>
>>
>> And also I tried running the above query as it is. I am not getting
>> expected output instead of that I am getting output like this which is
>> wrong If you compare my expected output with the below output-
>>
>>  *1345653 110909316904    0*
>> *1345653 221065796761    1*
>> *1345653 330760137950    2*
>>
>>
>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>
>>> Remove pid,time from DISTRIBUTE BY.
>>>
>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> Modified Query that I wrote and its not working as expected output is.
>>>>
>>>> *
>>>> *
>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>> *FROM (*
>>>> *    SELECT bid, pid, time*
>>>> *    FROM table1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>> '2012-07-09'*
>>>> *    DISTRIBUTE BY bid,pid,time*
>>>> *    SORT BY bid, time desc*
>>>> *) a*
>>>> *WHERE rank(bid) < 3;*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> I wrote this query after modifying it-
>>>>>
>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>> UNIX_TIMESTAMP(created_time)*
>>>>> *FROM (*
>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>> *    FROM testingtable1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>> int))) = '2012-07-09'*
>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>> *    SORT BY buyer_id, created_time desc*
>>>>> *) a*
>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>
>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>> rows-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>
>>>>> These two rows are missing-
>>>>>
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>>  So full output should be like this-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>> comptechgeeky@gmail.com> wrote:
>>>>>
>>>>>> Can you show me the exact query that I need to do for this particular
>>>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>>>> to HiveQL.
>>>>>>
>>>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>>>> timestamp.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>>
>>>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>>>
>>>>>>> Phil.
>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is the below data in my Table1
>>>>>>>>
>>>>>>>>
>>>>>>>> BID       PID                       TIME
>>>>>>>>
>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>
>>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>>> table like this-
>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>>> like this-
>>>>>>>>
>>>>>>>> Output that I need:-
>>>>>>>>
>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>
>>>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>>>
>>>>>>>> And for this I created rank UDF (User Defined Function). And I
>>>>>>>> wrote the below query but its not working for me. Can anyone help me on
>>>>>>>> this?
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>> *    FROM table1*
>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>>> int))) = '2012-07-09'*
>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> Met vriendelijke groet,
>
> *Jasper Knulst*
> Consultant *|* Incentro Business Intelligence
>
> Gildeweg 5b, Nootdorp
> The Netherlands, 2632 BD
>
> *E:* jasper.knulst@incentro.com
> *T:* +31157640750
> *M: *+31619667511
> *W:* www.incentro.com
> [image: Logo Incentro]<http://www.incentro.com/nl/article/incentro-de-beste-werkgever-van-nederland>
>

Re: Something wrong with my query to get TOP 3?

Posted by Jasper Knulst <ja...@incentro.com>.
I am not really aware of your use case.
Play around with it. At least the rank function is now properly applied.


Maybe, remove pid from the DISTRIBUTE and de SORT clauses ??

Jasper



2012/7/20 comptech geeky <co...@gmail.com>

> Hi Igor,
>
> I am not sure what I have to remove from Distribute By as in distribute by
> we have bid, pid and you said remove bid and time from distribute by and it
> doesn't have time
>
> *SELECT bid, pid, rank FROM *
>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
> *
>   *
>            ( SELECT bid, pid, time FROM table1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
> '2012-07-09' *
> *            DISTRIBUTE BY bid,pid*
> *            SORT BY bid,pid, time desc) A
> *
> *
>       ) B
> *
> *WHERE rank < 3;*
>
>
> And also I tried running the above query as it is. I am not getting
> expected output instead of that I am getting output like this which is
> wrong If you compare my expected output with the below output-
>
> *1345653 110909316904    0*
> *1345653 221065796761    1*
> *1345653 330760137950    2*
>
>
> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> Remove pid,time from DISTRIBUTE BY.
>>
>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Modified Query that I wrote and its not working as expected output is.
>>>
>>> *
>>> *
>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>> *FROM (*
>>> *    SELECT bid, pid, time*
>>> *    FROM table1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>> '2012-07-09'*
>>> *    DISTRIBUTE BY bid,pid,time*
>>> *    SORT BY bid, time desc*
>>> *) a*
>>> *WHERE rank(bid) < 3;*
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> I wrote this query after modifying it-
>>>>
>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>> UNIX_TIMESTAMP(created_time)*
>>>> *FROM (*
>>>> *    SELECT buyer_id, item_id, created_time*
>>>> *    FROM testingtable1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>> int))) = '2012-07-09'*
>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>> *    SORT BY buyer_id, created_time desc*
>>>> *) a*
>>>> *WHERE rank(buyer_id) < 3;*
>>>>
>>>> And the output I got is which is sligtly wrong as it is missing two
>>>> rows-
>>>>
>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>
>>>> These two rows are missing-
>>>>
>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>
>>>>  So full output should be like this-
>>>>
>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> Can you show me the exact query that I need to do for this particular
>>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>>> to HiveQL.
>>>>>
>>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>>> timestamp.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>
>>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>>
>>>>>> Phil.
>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This is the below data in my Table1
>>>>>>>
>>>>>>>
>>>>>>> BID       PID                       TIME
>>>>>>>
>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>> table like this-
>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>> like this-
>>>>>>>
>>>>>>> Output that I need:-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>>
>>>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>> *    FROM table1*
>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>> int))) = '2012-07-09'*
>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

Met vriendelijke groet,

*Jasper Knulst*
Consultant *|* Incentro Business Intelligence

Gildeweg 5b, Nootdorp
The Netherlands, 2632 BD

*E:* jasper.knulst@incentro.com
*T:* +31157640750
*M: *+31619667511
*W:* www.incentro.com
[image: Logo Incentro]<http://www.incentro.com/nl/article/incentro-de-beste-werkgever-van-nederland>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
I wrote this query few minutes back-

*select bid, pid, time from (*
*    select bid, pid, time, rank() over (partition by bid, pid order by
time desc) as k *
*    from table1 ) as x *
*where k  <=3*
*order by bid, pid, time desc*


Do you think this query will work with my Rank function that I provided
below?




On Thu, Jul 19, 2012 at 4:05 PM, comptech geeky <co...@gmail.com>wrote:

> Hi Igor,
>
> I am new to HiveQL world. Don't know that much basically. Currently I have
> my Rank UDF function like this-
>
> *public final class Rank extends UDF{*
> *    private int  counter;*
> *    private String last_key;*
> *    public int evaluate(final String key){*
> *  if ( !key.equalsIgnoreCase(this.last_key) ) {*
> *     this.counter = 0;*
> *     this.last_key = key;*
> *  }*
> *  return this.counter++;*
> *    }*
> *}*
> *
> *
> And I tried that query after removing pid from distribute by and sort by
> clause, but I got the below output which is wrong again-
>
> *1345653 330760137950    0*
> *1345653 330760137950    1*
> *1345653 330760137950    2*
>
> But I need output something like this-
>
> *1345653    330760137950       2012-07-09 21:43:29*
> *1345653    330760137950       2012-07-09 21:42:29*
> *1345653    330760137950       2012-07-09 21:41:29*
> 1345653    110909316904       2012-07-09 21:29:06
> 1345653    221065796761       2012-07-09 19:31:48
>
>
> Any help will be appreciated.
>
>
>
>
> On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> Actually, never mind. Looks like you need to partition by both bid and
>> pid. In that case, your problem is that rank() has to handle a combined
>> bid+pid key. So first you need to create a combined key, partition by that
>> key and pass it to your rank() function (assuming rank() knows to reset on
>> a new key). You can cast bid and pid to string and concatenate them with a
>> separator (bid_pid) to get a single partitioning key. Hope this makes sense.
>>
>> On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>
>>> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT
>>> clauses.
>>> Your very first query was correct except for the nested subquery part.
>>> (You don't need a double-nested subquery.)
>>>
>>> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> Hi Igor,
>>>>
>>>> I am not sure what I have to remove from Distribute By as in distribute
>>>> by we have bid, pid and you said remove bid and time from distribute by and
>>>> it doesn't have time
>>>>
>>>> *SELECT bid, pid, rank FROM *
>>>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>>>> FROM
>>>> *
>>>>   *
>>>>            ( SELECT bid, pid, time FROM table1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>> '2012-07-09' *
>>>> *            DISTRIBUTE BY bid,pid*
>>>> *            SORT BY bid,pid, time desc) A
>>>> *
>>>> *
>>>>       ) B
>>>> *
>>>> *WHERE rank < 3;*
>>>>
>>>>
>>>> And also I tried running the above query as it is. I am not getting
>>>> expected output instead of that I am getting output like this which is
>>>> wrong If you compare my expected output with the below output-
>>>>
>>>>  *1345653 110909316904    0*
>>>> *1345653 221065796761    1*
>>>> *1345653 330760137950    2*
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com>wrote:
>>>>
>>>>> Remove pid,time from DISTRIBUTE BY.
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <
>>>>> comptechgeeky@gmail.com> wrote:
>>>>>
>>>>>> Modified Query that I wrote and its not working as expected output is.
>>>>>>
>>>>>> *
>>>>>> *
>>>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>>>> *FROM (*
>>>>>> *    SELECT bid, pid, time*
>>>>>> *    FROM table1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>>>> '2012-07-09'*
>>>>>> *    DISTRIBUTE BY bid,pid,time*
>>>>>> *    SORT BY bid, time desc*
>>>>>> *) a*
>>>>>> *WHERE rank(bid) < 3;*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>>>> comptechgeeky@gmail.com> wrote:
>>>>>>
>>>>>>> I wrote this query after modifying it-
>>>>>>>
>>>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>>>> UNIX_TIMESTAMP(created_time)*
>>>>>>> *FROM (*
>>>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>>>> *    FROM testingtable1*
>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>> int))) = '2012-07-09'*
>>>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>> *) a*
>>>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>>>
>>>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>>>> rows-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>
>>>>>>> These two rows are missing-
>>>>>>>
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>>  So full output should be like this-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>>>> comptechgeeky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Can you show me the exact query that I need to do for this
>>>>>>>> particular problem consideing my scenario? It will be of great help to me.
>>>>>>>> As I am new to HiveQL.
>>>>>>>>
>>>>>>>> I need TOP 3 for those if BID and PID gets matched but with
>>>>>>>> different timestamp.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Your rank() is being evaluated map side. Put your distribute by
>>>>>>>>> and sort by in an inner query, and then evaluate your rank() in an outer
>>>>>>>>> query.
>>>>>>>>>
>>>>>>>>> Phil.
>>>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This is the below data in my Table1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BID       PID                       TIME
>>>>>>>>>>
>>>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>>>
>>>>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>>>>> table like this-
>>>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>>>>> like this-
>>>>>>>>>>
>>>>>>>>>> Output that I need:-
>>>>>>>>>>
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>>>
>>>>>>>>>> So Basically If BID and PID are same but with different
>>>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending order
>>>>>>>>>>
>>>>>>>>>> And for this I created rank UDF (User Defined Function). And I
>>>>>>>>>> wrote the below query but its not working for me. Can anyone help me on
>>>>>>>>>> this?
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>>>> *    FROM table1*
>>>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time)
>>>>>>>>>> as int))) = '2012-07-09'*
>>>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Hi Igor,

I am new to HiveQL world. Don't know that much basically. Currently I have
my Rank UDF function like this-

*public final class Rank extends UDF{*
*    private int  counter;*
*    private String last_key;*
*    public int evaluate(final String key){*
*  if ( !key.equalsIgnoreCase(this.last_key) ) {*
*     this.counter = 0;*
*     this.last_key = key;*
*  }*
*  return this.counter++;*
*    }*
*}*
*
*
And I tried that query after removing pid from distribute by and sort by
clause, but I got the below output which is wrong again-

*1345653 330760137950    0*
*1345653 330760137950    1*
*1345653 330760137950    2*

But I need output something like this-

*1345653    330760137950       2012-07-09 21:43:29*
*1345653    330760137950       2012-07-09 21:42:29*
*1345653    330760137950       2012-07-09 21:41:29*
1345653    110909316904       2012-07-09 21:29:06
1345653    221065796761       2012-07-09 19:31:48


Any help will be appreciated.




On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <ig...@decide.com> wrote:

> Actually, never mind. Looks like you need to partition by both bid and
> pid. In that case, your problem is that rank() has to handle a combined
> bid+pid key. So first you need to create a combined key, partition by that
> key and pass it to your rank() function (assuming rank() knows to reset on
> a new key). You can cast bid and pid to string and concatenate them with a
> separator (bid_pid) to get a single partitioning key. Hope this makes sense.
>
> On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
>> Your very first query was correct except for the nested subquery part.
>> (You don't need a double-nested subquery.)
>>
>> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Hi Igor,
>>>
>>> I am not sure what I have to remove from Distribute By as in distribute
>>> by we have bid, pid and you said remove bid and time from distribute by and
>>> it doesn't have time
>>>
>>> *SELECT bid, pid, rank FROM *
>>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>>> FROM
>>> *
>>>   *
>>>            ( SELECT bid, pid, time FROM table1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>> '2012-07-09' *
>>> *            DISTRIBUTE BY bid,pid*
>>> *            SORT BY bid,pid, time desc) A
>>> *
>>> *
>>>       ) B
>>> *
>>> *WHERE rank < 3;*
>>>
>>>
>>> And also I tried running the above query as it is. I am not getting
>>> expected output instead of that I am getting output like this which is
>>> wrong If you compare my expected output with the below output-
>>>
>>>  *1345653 110909316904    0*
>>> *1345653 221065796761    1*
>>> *1345653 330760137950    2*
>>>
>>>
>>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>>
>>>> Remove pid,time from DISTRIBUTE BY.
>>>>
>>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> Modified Query that I wrote and its not working as expected output is.
>>>>>
>>>>> *
>>>>> *
>>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>>> *FROM (*
>>>>> *    SELECT bid, pid, time*
>>>>> *    FROM table1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>>> '2012-07-09'*
>>>>> *    DISTRIBUTE BY bid,pid,time*
>>>>> *    SORT BY bid, time desc*
>>>>> *) a*
>>>>> *WHERE rank(bid) < 3;*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>>> comptechgeeky@gmail.com> wrote:
>>>>>
>>>>>> I wrote this query after modifying it-
>>>>>>
>>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>>> UNIX_TIMESTAMP(created_time)*
>>>>>> *FROM (*
>>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>>> *    FROM testingtable1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>> int))) = '2012-07-09'*
>>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>> *) a*
>>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>>
>>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>>> rows-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>
>>>>>> These two rows are missing-
>>>>>>
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>>  So full output should be like this-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>>> comptechgeeky@gmail.com> wrote:
>>>>>>
>>>>>>> Can you show me the exact query that I need to do for this
>>>>>>> particular problem consideing my scenario? It will be of great help to me.
>>>>>>> As I am new to HiveQL.
>>>>>>>
>>>>>>> I need TOP 3 for those if BID and PID gets matched but with
>>>>>>> different timestamp.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>>>
>>>>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>>>>
>>>>>>>> Phil.
>>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is the below data in my Table1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BID       PID                       TIME
>>>>>>>>>
>>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>>
>>>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>>>> table like this-
>>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>>>> like this-
>>>>>>>>>
>>>>>>>>> Output that I need:-
>>>>>>>>>
>>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>>
>>>>>>>>> So Basically If BID and PID are same but with different
>>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending order
>>>>>>>>>
>>>>>>>>> And for this I created rank UDF (User Defined Function). And I
>>>>>>>>> wrote the below query but its not working for me. Can anyone help me on
>>>>>>>>> this?
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>>> *    FROM table1*
>>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time)
>>>>>>>>> as int))) = '2012-07-09'*
>>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by Igor Tatarinov <ig...@decide.com>.
Actually, never mind. Looks like you need to partition by both bid and pid.
In that case, your problem is that rank() has to handle a combined bid+pid
key. So first you need to create a combined key, partition by that key and
pass it to your rank() function (assuming rank() knows to reset on a new
key). You can cast bid and pid to string and concatenate them with a
separator (bid_pid) to get a single partitioning key. Hope this makes sense.

On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <ig...@decide.com> wrote:

> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
> Your very first query was correct except for the nested subquery part.
> (You don't need a double-nested subquery.)
>
> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <co...@gmail.com>wrote:
>
>> Hi Igor,
>>
>> I am not sure what I have to remove from Distribute By as in distribute
>> by we have bid, pid and you said remove bid and time from distribute by and
>> it doesn't have time
>>
>> *SELECT bid, pid, rank FROM *
>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>> FROM
>> *
>>   *
>>            ( SELECT bid, pid, time FROM table1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>> '2012-07-09' *
>> *            DISTRIBUTE BY bid,pid*
>> *            SORT BY bid,pid, time desc) A
>> *
>> *
>>       ) B
>> *
>> *WHERE rank < 3;*
>>
>>
>> And also I tried running the above query as it is. I am not getting
>> expected output instead of that I am getting output like this which is
>> wrong If you compare my expected output with the below output-
>>
>>  *1345653 110909316904    0*
>> *1345653 221065796761    1*
>> *1345653 330760137950    2*
>>
>>
>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:
>>
>>> Remove pid,time from DISTRIBUTE BY.
>>>
>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> Modified Query that I wrote and its not working as expected output is.
>>>>
>>>> *
>>>> *
>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>> *FROM (*
>>>> *    SELECT bid, pid, time*
>>>> *    FROM table1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>> '2012-07-09'*
>>>> *    DISTRIBUTE BY bid,pid,time*
>>>> *    SORT BY bid, time desc*
>>>> *) a*
>>>> *WHERE rank(bid) < 3;*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> I wrote this query after modifying it-
>>>>>
>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>> UNIX_TIMESTAMP(created_time)*
>>>>> *FROM (*
>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>> *    FROM testingtable1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>> int))) = '2012-07-09'*
>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>> *    SORT BY buyer_id, created_time desc*
>>>>> *) a*
>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>
>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>> rows-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>
>>>>> These two rows are missing-
>>>>>
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>>  So full output should be like this-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>> comptechgeeky@gmail.com> wrote:
>>>>>
>>>>>> Can you show me the exact query that I need to do for this particular
>>>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>>>> to HiveQL.
>>>>>>
>>>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>>>> timestamp.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>>
>>>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>>>
>>>>>>> Phil.
>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is the below data in my Table1
>>>>>>>>
>>>>>>>>
>>>>>>>> BID       PID                       TIME
>>>>>>>>
>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>
>>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>>> table like this-
>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>>> like this-
>>>>>>>>
>>>>>>>> Output that I need:-
>>>>>>>>
>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>
>>>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>>>
>>>>>>>> And for this I created rank UDF (User Defined Function). And I
>>>>>>>> wrote the below query but its not working for me. Can anyone help me on
>>>>>>>> this?
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>> *    FROM table1*
>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>>> int))) = '2012-07-09'*
>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by Igor Tatarinov <ig...@decide.com>.
Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
Your very first query was correct except for the nested subquery part. (You
don't need a double-nested subquery.)

On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <co...@gmail.com>wrote:

> Hi Igor,
>
> I am not sure what I have to remove from Distribute By as in distribute by
> we have bid, pid and you said remove bid and time from distribute by and it
> doesn't have time
>
> *SELECT bid, pid, rank FROM *
>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
> *
>   *
>            ( SELECT bid, pid, time FROM table1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
> '2012-07-09' *
> *            DISTRIBUTE BY bid,pid*
> *            SORT BY bid,pid, time desc) A
> *
> *
>       ) B
> *
> *WHERE rank < 3;*
>
>
> And also I tried running the above query as it is. I am not getting
> expected output instead of that I am getting output like this which is
> wrong If you compare my expected output with the below output-
>
> *1345653 110909316904    0*
> *1345653 221065796761    1*
> *1345653 330760137950    2*
>
>
> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> Remove pid,time from DISTRIBUTE BY.
>>
>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Modified Query that I wrote and its not working as expected output is.
>>>
>>> *
>>> *
>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>> *FROM (*
>>> *    SELECT bid, pid, time*
>>> *    FROM table1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>> '2012-07-09'*
>>> *    DISTRIBUTE BY bid,pid,time*
>>> *    SORT BY bid, time desc*
>>> *) a*
>>> *WHERE rank(bid) < 3;*
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> I wrote this query after modifying it-
>>>>
>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>> UNIX_TIMESTAMP(created_time)*
>>>> *FROM (*
>>>> *    SELECT buyer_id, item_id, created_time*
>>>> *    FROM testingtable1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>> int))) = '2012-07-09'*
>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>> *    SORT BY buyer_id, created_time desc*
>>>> *) a*
>>>> *WHERE rank(buyer_id) < 3;*
>>>>
>>>> And the output I got is which is sligtly wrong as it is missing two
>>>> rows-
>>>>
>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>
>>>> These two rows are missing-
>>>>
>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>
>>>>  So full output should be like this-
>>>>
>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> Can you show me the exact query that I need to do for this particular
>>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>>> to HiveQL.
>>>>>
>>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>>> timestamp.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>
>>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>>
>>>>>> Phil.
>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This is the below data in my Table1
>>>>>>>
>>>>>>>
>>>>>>> BID       PID                       TIME
>>>>>>>
>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>> table like this-
>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>> but with different timestamps in red color. So I need the output something
>>>>>>> like this-
>>>>>>>
>>>>>>> Output that I need:-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>>
>>>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>> *    FROM table1*
>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>> int))) = '2012-07-09'*
>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Hi Igor,

I am not sure what I have to remove from Distribute By as in distribute by
we have bid, pid and you said remove bid and time from distribute by and it
doesn't have time

*SELECT bid, pid, rank FROM *
      *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
*
  *
           ( SELECT bid, pid, time FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09' *
*            DISTRIBUTE BY bid,pid*
*            SORT BY bid,pid, time desc) A
*
*
      ) B
*
*WHERE rank < 3;*


And also I tried running the above query as it is. I am not getting
expected output instead of that I am getting output like this which is
wrong If you compare my expected output with the below output-

*1345653 110909316904    0*
*1345653 221065796761    1*
*1345653 330760137950    2*


On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <ig...@decide.com> wrote:

> Remove pid,time from DISTRIBUTE BY.
>
> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:
>
>> Modified Query that I wrote and its not working as expected output is.
>>
>> *
>> *
>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>> *FROM (*
>> *    SELECT bid, pid, time*
>> *    FROM table1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>> '2012-07-09'*
>> *    DISTRIBUTE BY bid,pid,time*
>> *    SORT BY bid, time desc*
>> *) a*
>> *WHERE rank(bid) < 3;*
>>
>>
>>
>>
>>
>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> I wrote this query after modifying it-
>>>
>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>> UNIX_TIMESTAMP(created_time)*
>>> *FROM (*
>>> *    SELECT buyer_id, item_id, created_time*
>>> *    FROM testingtable1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>> int))) = '2012-07-09'*
>>> *    DISTRIBUTE BY buyer_id,item_id*
>>> *    SORT BY buyer_id, created_time desc*
>>> *) a*
>>> *WHERE rank(buyer_id) < 3;*
>>>
>>> And the output I got is which is sligtly wrong as it is missing two rows-
>>>
>>> *1345653    330760137950       2012-07-09 21:43:29*
>>> *1345653    330760137950       2012-07-09 21:42:29*
>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>
>>> These two rows are missing-
>>>
>>> 1345653    110909316904       2012-07-09 21:29:06
>>> 1345653    221065796761       2012-07-09 19:31:48
>>>
>>>  So full output should be like this-
>>>
>>> *1345653    330760137950       2012-07-09 21:43:29*
>>> *1345653    330760137950       2012-07-09 21:42:29*
>>> *1345653    330760137950       2012-07-09 21:41:29*
>>> 1345653    110909316904       2012-07-09 21:29:06
>>> 1345653    221065796761       2012-07-09 19:31:48
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <comptechgeeky@gmail.com
>>> > wrote:
>>>
>>>> Can you show me the exact query that I need to do for this particular
>>>> problem consideing my scenario? It will be of great help to me. As I am new
>>>> to HiveQL.
>>>>
>>>> I need TOP 3 for those if BID and PID gets matched but with different
>>>> timestamp.
>>>>
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>> philip.j.tromans@gmail.com> wrote:
>>>>
>>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>>
>>>>> Phil.
>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This is the below data in my Table1
>>>>>>
>>>>>>
>>>>>> BID       PID                       TIME
>>>>>>
>>>>>> --------------+-------------------------+--------------------------------
>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>> table like this-
>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>>>> with different timestamps in red color. So I need the output something like
>>>>>> this-
>>>>>>
>>>>>> Output that I need:-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>>
>>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>> *    FROM table1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>> int))) = '2012-07-09'*
>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by Igor Tatarinov <ig...@decide.com>.
Remove pid,time from DISTRIBUTE BY.

On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <co...@gmail.com>wrote:

> Modified Query that I wrote and its not working as expected output is.
>
> *
> *
> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
> *FROM (*
> *    SELECT bid, pid, time*
> *    FROM table1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
> '2012-07-09'*
> *    DISTRIBUTE BY bid,pid,time*
> *    SORT BY bid, time desc*
> *) a*
> *WHERE rank(bid) < 3;*
>
>
>
>
>
> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:
>
>> I wrote this query after modifying it-
>>
>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>> UNIX_TIMESTAMP(created_time)*
>> *FROM (*
>> *    SELECT buyer_id, item_id, created_time*
>> *    FROM testingtable1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>> int))) = '2012-07-09'*
>> *    DISTRIBUTE BY buyer_id,item_id*
>> *    SORT BY buyer_id, created_time desc*
>> *) a*
>> *WHERE rank(buyer_id) < 3;*
>>
>> And the output I got is which is sligtly wrong as it is missing two rows-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>>
>> These two rows are missing-
>>
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>  So full output should be like this-
>>
>> *1345653    330760137950       2012-07-09 21:43:29*
>> *1345653    330760137950       2012-07-09 21:42:29*
>> *1345653    330760137950       2012-07-09 21:41:29*
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>>
>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <co...@gmail.com>wrote:
>>
>>> Can you show me the exact query that I need to do for this particular
>>> problem consideing my scenario? It will be of great help to me. As I am new
>>> to HiveQL.
>>>
>>> I need TOP 3 for those if BID and PID gets matched but with different
>>> timestamp.
>>>
>>>
>>>
>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>> philip.j.tromans@gmail.com> wrote:
>>>
>>>> Your rank() is being evaluated map side. Put your distribute by and
>>>> sort by in an inner query, and then evaluate your rank() in an outer query.
>>>>
>>>> Phil.
>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is the below data in my Table1
>>>>>
>>>>>
>>>>> BID       PID                       TIME
>>>>>
>>>>> --------------+-------------------------+--------------------------------
>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So If I need to clarify the above scenario- I have data in above table
>>>>> like this-
>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>>> with different timestamps in red color. So I need the output something like
>>>>> this-
>>>>>
>>>>> Output that I need:-
>>>>>
>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>
>>>>> So Basically If BID and PID are same but with different timestamps,
>>>>> then I need TOP 3 sorted with TIME in descending order
>>>>>
>>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>>> the below query but its not working for me. Can anyone help me on this?
>>>>> *
>>>>> *
>>>>> *
>>>>> *
>>>>> *SELECT buyer_id, item_id, created_time*
>>>>> *    FROM table1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>> int))) = '2012-07-09'*
>>>>> *    DISTRIBUTE BY buyer_id*
>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>
>>>>>
>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Modified Query that I wrote and its not working as expected output is.

*
*
*SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
*FROM (*
*    SELECT bid, pid, time*
*    FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09'*
*    DISTRIBUTE BY bid,pid,time*
*    SORT BY bid, time desc*
*) a*
*WHERE rank(bid) < 3;*





On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <co...@gmail.com>wrote:

> I wrote this query after modifying it-
>
> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
> UNIX_TIMESTAMP(created_time)*
> *FROM (*
> *    SELECT buyer_id, item_id, created_time*
> *    FROM testingtable1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int)))
> = '2012-07-09'*
> *    DISTRIBUTE BY buyer_id,item_id*
> *    SORT BY buyer_id, created_time desc*
> *) a*
> *WHERE rank(buyer_id) < 3;*
>
> And the output I got is which is sligtly wrong as it is missing two rows-
>
> *1345653    330760137950       2012-07-09 21:43:29*
> *1345653    330760137950       2012-07-09 21:42:29*
> *1345653    330760137950       2012-07-09 21:41:29*
>
> These two rows are missing-
>
> 1345653    110909316904       2012-07-09 21:29:06
> 1345653    221065796761       2012-07-09 19:31:48
>
> So full output should be like this-
>
> *1345653    330760137950       2012-07-09 21:43:29*
> *1345653    330760137950       2012-07-09 21:42:29*
> *1345653    330760137950       2012-07-09 21:41:29*
> 1345653    110909316904       2012-07-09 21:29:06
> 1345653    221065796761       2012-07-09 19:31:48
>
>
> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <co...@gmail.com>wrote:
>
>> Can you show me the exact query that I need to do for this particular
>> problem consideing my scenario? It will be of great help to me. As I am new
>> to HiveQL.
>>
>> I need TOP 3 for those if BID and PID gets matched but with different
>> timestamp.
>>
>>
>>
>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>> philip.j.tromans@gmail.com> wrote:
>>
>>> Your rank() is being evaluated map side. Put your distribute by and sort
>>> by in an inner query, and then evaluate your rank() in an outer query.
>>>
>>> Phil.
>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>>> wrote:
>>>
>>>> This is the below data in my Table1
>>>>
>>>>
>>>> BID       PID                       TIME
>>>>
>>>> --------------+-------------------------+--------------------------------
>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>
>>>> So If I need to clarify the above scenario- I have data in above table
>>>> like this-
>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>>> with different timestamps in red color. So I need the output something like
>>>> this-
>>>>
>>>> Output that I need:-
>>>>
>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>
>>>> So Basically If BID and PID are same but with different timestamps,
>>>> then I need TOP 3 sorted with TIME in descending order
>>>>
>>>> And for this I created rank UDF (User Defined Function). And I wrote
>>>> the below query but its not working for me. Can anyone help me on this?
>>>> *
>>>> *
>>>> *
>>>> *
>>>> *SELECT buyer_id, item_id, created_time*
>>>> *    FROM table1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>> int))) = '2012-07-09'*
>>>> *    DISTRIBUTE BY buyer_id*
>>>> *    SORT BY buyer_id, created_time desc*
>>>>
>>>>
>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
I wrote this query after modifying it-

*SELECT buyer_id, item_id, rank(buyer_id), created_time,
UNIX_TIMESTAMP(created_time)*
*FROM (*
*    SELECT buyer_id, item_id, created_time*
*    FROM testingtable1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) =
'2012-07-09'*
*    DISTRIBUTE BY buyer_id,item_id*
*    SORT BY buyer_id, created_time desc*
*) a*
*WHERE rank(buyer_id) < 3;*

And the output I got is which is sligtly wrong as it is missing two rows-

*1345653    330760137950       2012-07-09 21:43:29*
*1345653    330760137950       2012-07-09 21:42:29*
*1345653    330760137950       2012-07-09 21:41:29*

These two rows are missing-

1345653    110909316904       2012-07-09 21:29:06
1345653    221065796761       2012-07-09 19:31:48

So full output should be like this-

*1345653    330760137950       2012-07-09 21:43:29*
*1345653    330760137950       2012-07-09 21:42:29*
*1345653    330760137950       2012-07-09 21:41:29*
1345653    110909316904       2012-07-09 21:29:06
1345653    221065796761       2012-07-09 19:31:48


On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <co...@gmail.com>wrote:

> Can you show me the exact query that I need to do for this particular
> problem consideing my scenario? It will be of great help to me. As I am new
> to HiveQL.
>
> I need TOP 3 for those if BID and PID gets matched but with different
> timestamp.
>
>
>
> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
> philip.j.tromans@gmail.com> wrote:
>
>> Your rank() is being evaluated map side. Put your distribute by and sort
>> by in an inner query, and then evaluate your rank() in an outer query.
>>
>> Phil.
>> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com>
>> wrote:
>>
>>> This is the below data in my Table1
>>>
>>>
>>> BID       PID                       TIME
>>> --------------+-------------------------+--------------------------------
>>> 1345653   330760137950       2012-07-09 21:42:29
>>> 1345653   330760137950       2012-07-09 21:43:29
>>> 1345653   330760137950       2012-07-09 21:40:29
>>> 1345653   330760137950       2012-07-09 21:41:29
>>> 1345653   110909316904       2012-07-09 21:29:06
>>> 1345653   221065796761       2012-07-09 19:31:48
>>>
>>> So If I need to clarify the above scenario- I have data in above table
>>> like this-
>>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>>> with different timestamps in red color. So I need the output something like
>>> this-
>>>
>>> Output that I need:-
>>>
>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>> *1345653    330760137950       2012-07-09 21:41:29*
>>> 1345653    110909316904       2012-07-09 21:29:06
>>> 1345653    221065796761       2012-07-09 19:31:48
>>>
>>> So Basically If BID and PID are same but with different timestamps, then
>>> I need TOP 3 sorted with TIME in descending order
>>>
>>> And for this I created rank UDF (User Defined Function). And I wrote the
>>> below query but its not working for me. Can anyone help me on this?
>>> *
>>> *
>>> *
>>> *
>>> *SELECT buyer_id, item_id, created_time*
>>> *    FROM table1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>> int))) = '2012-07-09'*
>>> *    DISTRIBUTE BY buyer_id*
>>> *    SORT BY buyer_id, created_time desc*
>>>
>>>
>

Re: Something wrong with my query to get TOP 3?

Posted by comptech geeky <co...@gmail.com>.
Can you show me the exact query that I need to do for this particular
problem consideing my scenario? It will be of great help to me. As I am new
to HiveQL.

I need TOP 3 for those if BID and PID gets matched but with different
timestamp.

-Raihan Jamal

On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans
<ph...@gmail.com>wrote:

> Your rank() is being evaluated map side. Put your distribute by and sort
> by in an inner query, and then evaluate your rank() in an outer query.
>
> Phil.
> On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com> wrote:
>
>> This is the below data in my Table1
>>
>>
>> BID       PID                       TIME
>> --------------+-------------------------+--------------------------------
>> 1345653   330760137950       2012-07-09 21:42:29
>> 1345653   330760137950       2012-07-09 21:43:29
>> 1345653   330760137950       2012-07-09 21:40:29
>> 1345653   330760137950       2012-07-09 21:41:29
>> 1345653   110909316904       2012-07-09 21:29:06
>> 1345653   221065796761       2012-07-09 19:31:48
>>
>> So If I need to clarify the above scenario- I have data in above table
>> like this-
>> For USER *`1345653` *I have this PID `*330760137950` *four times but
>> with different timestamps in red color. So I need the output something like
>> this-
>>
>> Output that I need:-
>>
>> *1345653    330760137950       2012-07-09 21:43:29 *
>> *1345653    330760137950       2012-07-09 21:42:29 *
>> *1345653    330760137950       2012-07-09 21:41:29*
>> 1345653    110909316904       2012-07-09 21:29:06
>> 1345653    221065796761       2012-07-09 19:31:48
>>
>> So Basically If BID and PID are same but with different timestamps, then
>> I need TOP 3 sorted with TIME in descending order
>>
>> And for this I created rank UDF (User Defined Function). And I wrote the
>> below query but its not working for me. Can anyone help me on this?
>> *
>> *
>> *
>> *
>> *SELECT buyer_id, item_id, created_time*
>> *    FROM table1*
>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>> int))) = '2012-07-09'*
>> *    DISTRIBUTE BY buyer_id*
>> *    SORT BY buyer_id, created_time desc*
>>
>>

Re: Something wrong with my query to get TOP 3?

Posted by Philip Tromans <ph...@gmail.com>.
Your rank() is being evaluated map side. Put your distribute by and sort by
in an inner query, and then evaluate your rank() in an outer query.

Phil.
On Jul 19, 2012 9:00 PM, "comptech geeky" <co...@gmail.com> wrote:

> This is the below data in my Table1
>
>
> BID       PID                       TIME
> --------------+-------------------------+--------------------------------
> 1345653   330760137950       2012-07-09 21:42:29
> 1345653   330760137950       2012-07-09 21:43:29
> 1345653   330760137950       2012-07-09 21:40:29
> 1345653   330760137950       2012-07-09 21:41:29
> 1345653   110909316904       2012-07-09 21:29:06
> 1345653   221065796761       2012-07-09 19:31:48
>
> So If I need to clarify the above scenario- I have data in above table
> like this-
> For USER *`1345653` *I have this PID `*330760137950` *four times but with
> different timestamps in red color. So I need the output something like this-
>
> Output that I need:-
>
> *1345653    330760137950       2012-07-09 21:43:29 *
> *1345653    330760137950       2012-07-09 21:42:29 *
> *1345653    330760137950       2012-07-09 21:41:29*
> 1345653    110909316904       2012-07-09 21:29:06
> 1345653    221065796761       2012-07-09 19:31:48
>
> So Basically If BID and PID are same but with different timestamps, then I
> need TOP 3 sorted with TIME in descending order
>
> And for this I created rank UDF (User Defined Function). And I wrote the
> below query but its not working for me. Can anyone help me on this?
> *
> *
> *
> *
> *SELECT buyer_id, item_id, created_time*
> *    FROM table1*
> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int)))
> = '2012-07-09' AND rank(buyer_id) < 3*
> *    DISTRIBUTE BY buyer_id*
> *    SORT BY buyer_id, created_time desc*
>
>