You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/07/10 04:30:53 UTC

Find TOP 10 using HiveQL

When I run this query,

SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;


I always get error as-

*FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
expression*

Is there any way around to use TOP 10 or something similar that will work
in HiveQL?



*Raihan Jamal*

Re: Find TOP 10 using HiveQL

Posted by Wouter de Bie <wo...@spotify.com>.
You could use TRANSFORM with a simple awk script:

TRANSFORM(a, b, c, d)
  USING "/usr/bin/awk '
  {if($1!=c){c=$1; a=0}; if(a<20){print $0; a++}}'"


This will create a top 20 for each group.  

--Wouter de Bie
Team Lead Analytics Infrastructure, Spotify
wouter@spotify.com (mailto:wouter@spotify.com)
+46 72 018 0777


On Tuesday, July 10, 2012 at 9:15 AM, Raihan Jamal wrote:

> I am trying that solution. Currently I am running my query to see what result I am getting back with UDF.
>  
>  
>  
>  
> Raihan Jamal  
>  
>  
> On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <nitinpawar432@gmail.com (mailto:nitinpawar432@gmail.com)> wrote:
> > i thought you managed to solve this with rank??
> >  
> >  
> > On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <jamalraihan@gmail.com (mailto:jamalraihan@gmail.com)> wrote:
> > > Problem with that approach is, with LIMIT 10, If I am putting after desc, then it will get only 10 rows irrespective of BUYER_ID. But I need specifically for each BUYER_ID 10 latest rows.
> > >  
> > >  
> > >  
> > >  
> > > Raihan Jamal
> > >  
> > >  
> > >  
> > > On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <abhishektiwari.btech@gmail.com (mailto:abhishektiwari.btech@gmail.com)> wrote:
> > > > Raihan,
> > > >  
> > > > Andes suggests you use 'limit' after 'desc' ie Hive with sort your query results in descending order and then chop off any more than first 10 records.  
> > > > However, Hive will still run the full scan on data since it has to find the sorted list of records for you.   
> > > >  
> > > >  
> > > > On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <jamalraihan@gmail.com (mailto:jamalraihan@gmail.com)> wrote:
> > > > > This is my below requirement. I need-  Find `TOP 10` data for each `BUYER_ID, So I cannot use LIMIT 10 here in this case.
> > > > >  
> > > > > This is the below table
> > > > >  
> > > > >     CREATE TABLE IF NOT EXISTS TestingTable1  
> > > > >     (  
> > > > >     BUYER_ID BIGINT,
> > > > >     ITEM_ID BIGINT,  
> > > > >     CREATED_TIME STRING
> > > > >     )
> > > > >  
> > > > > And this is the below data in the above table-  
> > > > >  
> > > > >     BUYER_ID    |    ITEM_ID   | CREATED_TIME
> > > > >     ------------+------------------+-----------------------
> > > > >     1015826235      220003038067        2012-07-09 19:40:21,
> > > > >     1015826235      300003861266        2012-07-09 18:19:59,
> > > > >     1015826235      140002997245        2012-07-09 09:23:17,
> > > > >     1015826235      210002448035        2012-07-09 22:21:11,
> > > > >     1015826235      260003553381        2012-07-09 07:09:56,
> > > > >     1015826235      260003553382        2012-07-09 19:40:39,
> > > > >     1015826235      260003553383        2012-07-09 06:58:47,
> > > > >     1015826235      260003553384        2012-07-09 07:28:47,
> > > > >     1015826235      260003553385        2012-07-09 08:48:47,
> > > > >     1015826235      260003553386        2012-07-09 06:38:47,
> > > > >     1015826235      260003553387        2012-07-09 05:38:47,
> > > > >     1015826235      260003553388        2012-07-09 04:55:47,
> > > > >     1015826235      260003553389        2012-07-09 06:54:37,
> > > > >     34512201        597245693           2012-07-09 16:20:21,
> > > > >     34512201        8071787728          2012-07-09 15:19:59,
> > > > >     34512201        5868222883 (tel:5868222883)          2012-07-09 08:23:17,
> > > > >     34512201        2412180494          2012-07-09 22:21:11,
> > > > >     34512201        2422054205          2012-07-09 06:09:56,
> > > > >     34512201        1875744030          2012-07-09 19:40:39,
> > > > >     34512201        5639158173 (tel:5639158173)          2012-07-09 06:58:47,
> > > > >     34512201        5656232360          2012-07-09 07:28:47,  
> > > > >     34512201        959188449 2012-07-09 08:48:47,
> > > > >     34512201        4645350592          2012-07-09 06:38:47,
> > > > >     34512201        5657320532 2012-07-09 05:38:47,
> > > > >     34512201        290419656539 2012-07-09 04:55:47,
> > > > >  
> > > > > If you see the above data in the table, there are only two UNIQUE `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`. I need only 10 latest record basis on the time for that given day for each `BUYER_ID`.  
> > > > >  
> > > > > So for this `BUYER_ID` - `34512201` I need 10 latest record basis on `CREATED_TIME` for that given day only, it means for today's date I need 10 latest record for each `BUYER_ID`.
> > > > >  
> > > > > And each `BUYER_ID` can have any day's data. But I am specifically interested for day before  today's data(means yesterday's date always) by checking at the `CREATED_TIME`
> > > > >  
> > > > > **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**  
> > > > >  
> > > > > Sample Output.
> > > > >  
> > > > >     BUYER_ID    |    ITEM_ID   | CREATED_TIME  
> > > > >     ------------+------------------+-----------------------
> > > > >     34512201        2412180494          2012-07-09 22:21:11
> > > > >     34512201        1875744030          2012-07-09 19:40:39
> > > > >     34512201        597245693           2012-07-09 16:20:21
> > > > >     34512201        8071787728          2012-07-09 15:19:59
> > > > >     34512201        959188449 2012-07-09 08:48:47
> > > > >     34512201        5868222883 (tel:5868222883)          2012-07-09 08:23:17
> > > > >     34512201        5656232360          2012-07-09 07:28:47  
> > > > >     34512201        5639158173 (tel:5639158173)          2012-07-09 06:58:47
> > > > >     34512201        4645350592          2012-07-09 06:38:47
> > > > >     34512201        2422054205          2012-07-09 06:09:56
> > > > >     1015826235      210002448035        2012-07-09 22:21:11
> > > > >     1015826235      260003553382        2012-07-09 19:40:39
> > > > >     1015826235      220003038067        2012-07-09 19:40:21
> > > > >     1015826235      300003861266        2012-07-09 18:19:59
> > > > >     1015826235      140002997245        2012-07-09 09:23:17
> > > > >     1015826235      260003553385        2012-07-09 08:48:47
> > > > >     1015826235      260003553384        2012-07-09 07:28:47
> > > > >     1015826235      260003553381        2012-07-09 07:09:56
> > > > >     1015826235      260003553383        2012-07-09 06:58:47
> > > > >     1015826235      260003553389        2012-07-09 06:54:37
> > > > >  
> > > > >  
> > > > > Raihan Jamal
> > > > >  
> > > > >  
> > > > >  
> > > > > On Mon, Jul 9, 2012 at 7:56 PM, Andes <ylyy-1985@163.com (mailto:ylyy-1985@163.com)> wrote:
> > > > > > hello, you can use "desc" and "limit 10" to filter the top 10.  
> > > > > >   
> > > > > > 2012-07-10
> > > > > >  
> > > > > > Best Regards
> > > > > > Andes
> > > > > >   
> > > > > >  
> > > > > >  
> > > > > > 发件人:Raihan Jamal
> > > > > > 发送时间:2012-07-10 10 (tel:2012-07-10%C2%A010):31
> > > > > > 主题:Find TOP 10 using HiveQL
> > > > > > 收件人:"user"<user@hive.apache.org (mailto:user@hive.apache.org)>
> > > > > > 抄送:
> > > > > >   
> > > > > > When I run this query,
> > > > > >  
> > > > > > SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;
> > > > > >  
> > > > > >  
> > > > > > I always get error as-  
> > > > > >  
> > > > > > FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select expression  
> > > > > >  
> > > > > > Is there any way around to use TOP 10 or something similar that will work in HiveQL?
> > > > > >  
> > > > > >  
> > > > > > Raihan Jamal
> > > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > >  
> >  
> >  
> >  
> > --  
> > Nitin Pawar
> >  
>  


Re: Find TOP 10 using HiveQL

Posted by Raihan Jamal <ja...@gmail.com>.
I am trying that solution. Currently I am running my query to see what
result I am getting back with UDF.




*Raihan Jamal*



On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <ni...@gmail.com>wrote:

> i thought you managed to solve this with rank??
>
>
> On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Problem with that approach is, with LIMIT 10, If I am putting after desc,
>> then it will get only 10 rows irrespective of BUYER_ID. But I need
>> specifically for each BUYER_ID 10 latest rows.
>>
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
>> abhishektiwari.btech@gmail.com> wrote:
>>
>>> Raihan,
>>>
>>> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
>>> results in descending order and then chop off any more than first 10
>>> records.
>>> However, Hive will still run the full scan on data since it has to find
>>> the sorted list of records for you.
>>>
>>>
>>> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> This is my below requirement. I need-  *Find `TOP 10` data for each
>>>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>>>
>>>> This is the below table
>>>>
>>>>     CREATE TABLE IF NOT EXISTS TestingTable1
>>>>     (
>>>>     BUYER_ID BIGINT,
>>>>     ITEM_ID BIGINT,
>>>>     CREATED_TIME STRING
>>>>     )
>>>>
>>>> And this is the below data in the above table-
>>>>
>>>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>>>     ------------+------------------+-----------------------
>>>>     1015826235      220003038067        2012-07-09 19:40:21,
>>>>     1015826235      300003861266        2012-07-09 18:19:59,
>>>>     1015826235      140002997245        2012-07-09 09:23:17,
>>>>     1015826235      210002448035        2012-07-09 22:21:11,
>>>>     1015826235      260003553381        2012-07-09 07:09:56,
>>>>     1015826235      260003553382        2012-07-09 19:40:39,
>>>>     1015826235      260003553383        2012-07-09 06:58:47,
>>>>     1015826235      260003553384        2012-07-09 07:28:47,
>>>>     1015826235      260003553385        2012-07-09 08:48:47,
>>>>     1015826235      260003553386        2012-07-09 06:38:47,
>>>>     1015826235      260003553387        2012-07-09 05:38:47,
>>>>     1015826235      260003553388        2012-07-09 04:55:47,
>>>>     1015826235      260003553389        2012-07-09 06:54:37,
>>>>     34512201        597245693           2012-07-09 16:20:21,
>>>>     34512201        8071787728          2012-07-09 15:19:59,
>>>>     34512201        5868222883          2012-07-09 08:23:17,
>>>>     34512201        2412180494          2012-07-09 22:21:11,
>>>>     34512201        2422054205          2012-07-09 06:09:56,
>>>>     34512201        1875744030          2012-07-09 19:40:39,
>>>>     34512201        5639158173          2012-07-09 06:58:47,
>>>>     34512201        5656232360          2012-07-09 07:28:47,
>>>>     34512201        959188449 2012-07-09 08:48:47,
>>>>     34512201        4645350592          2012-07-09 06:38:47,
>>>>     34512201        5657320532 2012-07-09 05:38:47,
>>>>     34512201        290419656539 2012-07-09 04:55:47,
>>>>
>>>> If you see the above data in the table, there are only two UNIQUE
>>>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>>>> I need only 10 latest record basis on the time for that given day for each
>>>> `BUYER_ID`.
>>>>
>>>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>>>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>>>> latest record for each `BUYER_ID`.
>>>>
>>>> And each `BUYER_ID` can have any day's data. But I am specifically
>>>> interested for day before  today's data(means yesterday's date always) by
>>>> checking at the `CREATED_TIME`
>>>>
>>>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>>>
>>>> Sample Output.
>>>>
>>>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>>>     ------------+------------------+-----------------------
>>>>     34512201        2412180494          2012-07-09 22:21:11
>>>>     34512201        1875744030          2012-07-09 19:40:39
>>>>     34512201        597245693           2012-07-09 16:20:21
>>>>     34512201        8071787728          2012-07-09 15:19:59
>>>>     34512201        959188449 2012-07-09 08:48:47
>>>>     34512201        5868222883          2012-07-09 08:23:17
>>>>     34512201        5656232360          2012-07-09 07:28:47
>>>>     34512201        5639158173          2012-07-09 06:58:47
>>>>     34512201        4645350592          2012-07-09 06:38:47
>>>>     34512201        2422054205          2012-07-09 06:09:56
>>>>     1015826235      210002448035        2012-07-09 22:21:11
>>>>     1015826235      260003553382        2012-07-09 19:40:39
>>>>     1015826235      220003038067        2012-07-09 19:40:21
>>>>     1015826235      300003861266        2012-07-09 18:19:59
>>>>     1015826235      140002997245        2012-07-09 09:23:17
>>>>     1015826235      260003553385        2012-07-09 08:48:47
>>>>     1015826235      260003553384        2012-07-09 07:28:47
>>>>     1015826235      260003553381        2012-07-09 07:09:56
>>>>     1015826235      260003553383        2012-07-09 06:58:47
>>>>     1015826235      260003553389        2012-07-09 06:54:37
>>>>
>>>>
>>>> *Raihan Jamal*
>>>>
>>>>
>>>>
>>>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>>>
>>>>> **
>>>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>>>
>>>>> 2012-07-10
>>>>>  ------------------------------
>>>>>  **
>>>>> Best Regards
>>>>> Andes
>>>>>
>>>>> **
>>>>>  ------------------------------
>>>>>  *发件人:*Raihan Jamal
>>>>> *发送时间:*2012-07-10 10:31
>>>>> *主题:*Find TOP 10 using HiveQL
>>>>> *收件人:*"user"<us...@hive.apache.org>
>>>>> *抄送:*
>>>>>
>>>>>  When I run this query,
>>>>>
>>>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>>>> DESC;
>>>>>
>>>>>
>>>>> I always get error as-
>>>>>
>>>>>  *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in
>>>>> select expression*
>>>>>
>>>>> Is there any way around to use TOP 10 or something similar that will
>>>>> work in HiveQL?
>>>>>
>>>>>
>>>>>
>>>>> *Raihan Jamal*
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Find TOP 10 using HiveQL

Posted by Nitin Pawar <ni...@gmail.com>.
i thought you managed to solve this with rank??

On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <ja...@gmail.com>wrote:

> Problem with that approach is, with LIMIT 10, If I am putting after desc,
> then it will get only 10 rows irrespective of BUYER_ID. But I need
> specifically for each BUYER_ID 10 latest rows.
>
>
>
>
> *Raihan Jamal*
>
>
>
> On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
> abhishektiwari.btech@gmail.com> wrote:
>
>> Raihan,
>>
>> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
>> results in descending order and then chop off any more than first 10
>> records.
>> However, Hive will still run the full scan on data since it has to find
>> the sorted list of records for you.
>>
>>
>> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> This is my below requirement. I need-  *Find `TOP 10` data for each
>>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>>
>>> This is the below table
>>>
>>>     CREATE TABLE IF NOT EXISTS TestingTable1
>>>     (
>>>     BUYER_ID BIGINT,
>>>     ITEM_ID BIGINT,
>>>     CREATED_TIME STRING
>>>     )
>>>
>>> And this is the below data in the above table-
>>>
>>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>>     ------------+------------------+-----------------------
>>>     1015826235      220003038067        2012-07-09 19:40:21,
>>>     1015826235      300003861266        2012-07-09 18:19:59,
>>>     1015826235      140002997245        2012-07-09 09:23:17,
>>>     1015826235      210002448035        2012-07-09 22:21:11,
>>>     1015826235      260003553381        2012-07-09 07:09:56,
>>>     1015826235      260003553382        2012-07-09 19:40:39,
>>>     1015826235      260003553383        2012-07-09 06:58:47,
>>>     1015826235      260003553384        2012-07-09 07:28:47,
>>>     1015826235      260003553385        2012-07-09 08:48:47,
>>>     1015826235      260003553386        2012-07-09 06:38:47,
>>>     1015826235      260003553387        2012-07-09 05:38:47,
>>>     1015826235      260003553388        2012-07-09 04:55:47,
>>>     1015826235      260003553389        2012-07-09 06:54:37,
>>>     34512201        597245693           2012-07-09 16:20:21,
>>>     34512201        8071787728          2012-07-09 15:19:59,
>>>     34512201        5868222883          2012-07-09 08:23:17,
>>>     34512201        2412180494          2012-07-09 22:21:11,
>>>     34512201        2422054205          2012-07-09 06:09:56,
>>>     34512201        1875744030          2012-07-09 19:40:39,
>>>     34512201        5639158173          2012-07-09 06:58:47,
>>>     34512201        5656232360          2012-07-09 07:28:47,
>>>     34512201        959188449 2012-07-09 08:48:47,
>>>     34512201        4645350592          2012-07-09 06:38:47,
>>>     34512201        5657320532 2012-07-09 05:38:47,
>>>     34512201        290419656539 2012-07-09 04:55:47,
>>>
>>> If you see the above data in the table, there are only two UNIQUE
>>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>>> I need only 10 latest record basis on the time for that given day for each
>>> `BUYER_ID`.
>>>
>>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>>> latest record for each `BUYER_ID`.
>>>
>>> And each `BUYER_ID` can have any day's data. But I am specifically
>>> interested for day before  today's data(means yesterday's date always) by
>>> checking at the `CREATED_TIME`
>>>
>>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>>
>>> Sample Output.
>>>
>>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>>     ------------+------------------+-----------------------
>>>     34512201        2412180494          2012-07-09 22:21:11
>>>     34512201        1875744030          2012-07-09 19:40:39
>>>     34512201        597245693           2012-07-09 16:20:21
>>>     34512201        8071787728          2012-07-09 15:19:59
>>>     34512201        959188449 2012-07-09 08:48:47
>>>     34512201        5868222883          2012-07-09 08:23:17
>>>     34512201        5656232360          2012-07-09 07:28:47
>>>     34512201        5639158173          2012-07-09 06:58:47
>>>     34512201        4645350592          2012-07-09 06:38:47
>>>     34512201        2422054205          2012-07-09 06:09:56
>>>     1015826235      210002448035        2012-07-09 22:21:11
>>>     1015826235      260003553382        2012-07-09 19:40:39
>>>     1015826235      220003038067        2012-07-09 19:40:21
>>>     1015826235      300003861266        2012-07-09 18:19:59
>>>     1015826235      140002997245        2012-07-09 09:23:17
>>>     1015826235      260003553385        2012-07-09 08:48:47
>>>     1015826235      260003553384        2012-07-09 07:28:47
>>>     1015826235      260003553381        2012-07-09 07:09:56
>>>     1015826235      260003553383        2012-07-09 06:58:47
>>>     1015826235      260003553389        2012-07-09 06:54:37
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>>
>>>> **
>>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>>
>>>> 2012-07-10
>>>>  ------------------------------
>>>>  **
>>>> Best Regards
>>>> Andes
>>>>
>>>> **
>>>>  ------------------------------
>>>>  *发件人:*Raihan Jamal
>>>> *发送时间:*2012-07-10 10:31
>>>> *主题:*Find TOP 10 using HiveQL
>>>> *收件人:*"user"<us...@hive.apache.org>
>>>> *抄送:*
>>>>
>>>>  When I run this query,
>>>>
>>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>>> DESC;
>>>>
>>>>
>>>> I always get error as-
>>>>
>>>>  *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>>>> expression*
>>>>
>>>> Is there any way around to use TOP 10 or something similar that will
>>>> work in HiveQL?
>>>>
>>>>
>>>>
>>>> *Raihan Jamal*
>>>>
>>>>
>>>
>>
>>
>>
>>
>>
>


-- 
Nitin Pawar

Re: Find TOP 10 using HiveQL

Posted by Raihan Jamal <ja...@gmail.com>.
Problem with that approach is, with LIMIT 10, If I am putting after desc,
then it will get only 10 rows irrespective of BUYER_ID. But I need
specifically for each BUYER_ID 10 latest rows.




*Raihan Jamal*



On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
abhishektiwari.btech@gmail.com> wrote:

> Raihan,
>
> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
> results in descending order and then chop off any more than first 10
> records.
> However, Hive will still run the full scan on data since it has to find
> the sorted list of records for you.
>
>
> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> This is my below requirement. I need-  *Find `TOP 10` data for each
>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>
>> This is the below table
>>
>>     CREATE TABLE IF NOT EXISTS TestingTable1
>>     (
>>     BUYER_ID BIGINT,
>>     ITEM_ID BIGINT,
>>     CREATED_TIME STRING
>>     )
>>
>> And this is the below data in the above table-
>>
>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>     ------------+------------------+-----------------------
>>     1015826235      220003038067        2012-07-09 19:40:21,
>>     1015826235      300003861266        2012-07-09 18:19:59,
>>     1015826235      140002997245        2012-07-09 09:23:17,
>>     1015826235      210002448035        2012-07-09 22:21:11,
>>     1015826235      260003553381        2012-07-09 07:09:56,
>>     1015826235      260003553382        2012-07-09 19:40:39,
>>     1015826235      260003553383        2012-07-09 06:58:47,
>>     1015826235      260003553384        2012-07-09 07:28:47,
>>     1015826235      260003553385        2012-07-09 08:48:47,
>>     1015826235      260003553386        2012-07-09 06:38:47,
>>     1015826235      260003553387        2012-07-09 05:38:47,
>>     1015826235      260003553388        2012-07-09 04:55:47,
>>     1015826235      260003553389        2012-07-09 06:54:37,
>>     34512201        597245693           2012-07-09 16:20:21,
>>     34512201        8071787728          2012-07-09 15:19:59,
>>     34512201        5868222883          2012-07-09 08:23:17,
>>     34512201        2412180494          2012-07-09 22:21:11,
>>     34512201        2422054205          2012-07-09 06:09:56,
>>     34512201        1875744030          2012-07-09 19:40:39,
>>     34512201        5639158173          2012-07-09 06:58:47,
>>     34512201        5656232360          2012-07-09 07:28:47,
>>     34512201        959188449 2012-07-09 08:48:47,
>>     34512201        4645350592          2012-07-09 06:38:47,
>>     34512201        5657320532 2012-07-09 05:38:47,
>>     34512201        290419656539 2012-07-09 04:55:47,
>>
>> If you see the above data in the table, there are only two UNIQUE
>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>> I need only 10 latest record basis on the time for that given day for each
>> `BUYER_ID`.
>>
>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>> latest record for each `BUYER_ID`.
>>
>> And each `BUYER_ID` can have any day's data. But I am specifically
>> interested for day before  today's data(means yesterday's date always) by
>> checking at the `CREATED_TIME`
>>
>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>
>> Sample Output.
>>
>>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>>     ------------+------------------+-----------------------
>>     34512201        2412180494          2012-07-09 22:21:11
>>     34512201        1875744030          2012-07-09 19:40:39
>>     34512201        597245693           2012-07-09 16:20:21
>>     34512201        8071787728          2012-07-09 15:19:59
>>     34512201        959188449 2012-07-09 08:48:47
>>     34512201        5868222883          2012-07-09 08:23:17
>>     34512201        5656232360          2012-07-09 07:28:47
>>     34512201        5639158173          2012-07-09 06:58:47
>>     34512201        4645350592          2012-07-09 06:38:47
>>     34512201        2422054205          2012-07-09 06:09:56
>>     1015826235      210002448035        2012-07-09 22:21:11
>>     1015826235      260003553382        2012-07-09 19:40:39
>>     1015826235      220003038067        2012-07-09 19:40:21
>>     1015826235      300003861266        2012-07-09 18:19:59
>>     1015826235      140002997245        2012-07-09 09:23:17
>>     1015826235      260003553385        2012-07-09 08:48:47
>>     1015826235      260003553384        2012-07-09 07:28:47
>>     1015826235      260003553381        2012-07-09 07:09:56
>>     1015826235      260003553383        2012-07-09 06:58:47
>>     1015826235      260003553389        2012-07-09 06:54:37
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>
>>> **
>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>
>>> 2012-07-10
>>>  ------------------------------
>>>  **
>>> Best Regards
>>> Andes
>>>
>>> **
>>>  ------------------------------
>>>  *发件人:*Raihan Jamal
>>> *发送时间:*2012-07-10 10:31
>>> *主题:*Find TOP 10 using HiveQL
>>> *收件人:*"user"<us...@hive.apache.org>
>>> *抄送:*
>>>
>>>  When I run this query,
>>>
>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>> DESC;
>>>
>>>
>>> I always get error as-
>>>
>>>  *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>>> expression*
>>>
>>> Is there any way around to use TOP 10 or something similar that will
>>> work in HiveQL?
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>
>
>
>
>
>

Re: Find TOP 10 using HiveQL

Posted by Abhishek Tiwari <ab...@gmail.com>.
Raihan,

Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
results in descending order and then chop off any more than first 10
records.
However, Hive will still run the full scan on data since it has to find the
sorted list of records for you.

On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com> wrote:

> This is my below requirement. I need-  *Find `TOP 10` data for each
> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>
> This is the below table
>
>     CREATE TABLE IF NOT EXISTS TestingTable1
>     (
>     BUYER_ID BIGINT,
>     ITEM_ID BIGINT,
>     CREATED_TIME STRING
>     )
>
> And this is the below data in the above table-
>
>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>     ------------+------------------+-----------------------
>     1015826235      220003038067        2012-07-09 19:40:21,
>     1015826235      300003861266        2012-07-09 18:19:59,
>     1015826235      140002997245        2012-07-09 09:23:17,
>     1015826235      210002448035        2012-07-09 22:21:11,
>     1015826235      260003553381        2012-07-09 07:09:56,
>     1015826235      260003553382        2012-07-09 19:40:39,
>     1015826235      260003553383        2012-07-09 06:58:47,
>     1015826235      260003553384        2012-07-09 07:28:47,
>     1015826235      260003553385        2012-07-09 08:48:47,
>     1015826235      260003553386        2012-07-09 06:38:47,
>     1015826235      260003553387        2012-07-09 05:38:47,
>     1015826235      260003553388        2012-07-09 04:55:47,
>     1015826235      260003553389        2012-07-09 06:54:37,
>     34512201        597245693           2012-07-09 16:20:21,
>     34512201        8071787728          2012-07-09 15:19:59,
>     34512201        5868222883          2012-07-09 08:23:17,
>     34512201        2412180494          2012-07-09 22:21:11,
>     34512201        2422054205          2012-07-09 06:09:56,
>     34512201        1875744030          2012-07-09 19:40:39,
>     34512201        5639158173          2012-07-09 06:58:47,
>     34512201        5656232360          2012-07-09 07:28:47,
>     34512201        959188449 2012-07-09 08:48:47,
>     34512201        4645350592          2012-07-09 06:38:47,
>     34512201        5657320532 2012-07-09 05:38:47,
>     34512201        290419656539 2012-07-09 04:55:47,
>
> If you see the above data in the table, there are only two UNIQUE
> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
> I need only 10 latest record basis on the time for that given day for each
> `BUYER_ID`.
>
> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
> `CREATED_TIME` for that given day only, it means for today's date I need 10
> latest record for each `BUYER_ID`.
>
> And each `BUYER_ID` can have any day's data. But I am specifically
> interested for day before  today's data(means yesterday's date always) by
> checking at the `CREATED_TIME`
>
> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>
> Sample Output.
>
>     BUYER_ID    |    ITEM_ID   | CREATED_TIME
>     ------------+------------------+-----------------------
>     34512201        2412180494          2012-07-09 22:21:11
>     34512201        1875744030          2012-07-09 19:40:39
>     34512201        597245693           2012-07-09 16:20:21
>     34512201        8071787728          2012-07-09 15:19:59
>     34512201        959188449 2012-07-09 08:48:47
>     34512201        5868222883          2012-07-09 08:23:17
>     34512201        5656232360          2012-07-09 07:28:47
>     34512201        5639158173          2012-07-09 06:58:47
>     34512201        4645350592          2012-07-09 06:38:47
>     34512201        2422054205          2012-07-09 06:09:56
>     1015826235      210002448035        2012-07-09 22:21:11
>     1015826235      260003553382        2012-07-09 19:40:39
>     1015826235      220003038067        2012-07-09 19:40:21
>     1015826235      300003861266        2012-07-09 18:19:59
>     1015826235      140002997245        2012-07-09 09:23:17
>     1015826235      260003553385        2012-07-09 08:48:47
>     1015826235      260003553384        2012-07-09 07:28:47
>     1015826235      260003553381        2012-07-09 07:09:56
>     1015826235      260003553383        2012-07-09 06:58:47
>     1015826235      260003553389        2012-07-09 06:54:37
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>
>> **
>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>
>> 2012-07-10
>>  ------------------------------
>>  **
>> Best Regards
>> Andes
>>
>> **
>>  ------------------------------
>>  *发件人:*Raihan Jamal
>> *发送时间:*2012-07-10 10:31
>> *主题:*Find TOP 10 using HiveQL
>> *收件人:*"user"<us...@hive.apache.org>
>> *抄送:*
>>
>>  When I run this query,
>>
>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>> DESC;
>>
>>
>> I always get error as-
>>
>>  *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>> expression*
>>
>> Is there any way around to use TOP 10 or something similar that will work
>> in HiveQL?
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>

Re: Find TOP 10 using HiveQL

Posted by Raihan Jamal <ja...@gmail.com>.
This is my below requirement. I need-  *Find `TOP 10` data for each
`BUYER_ID, *So I cannot use LIMIT 10 here in this case.

This is the below table

    CREATE TABLE IF NOT EXISTS TestingTable1
    (
    BUYER_ID BIGINT,
    ITEM_ID BIGINT,
    CREATED_TIME STRING
    )

And this is the below data in the above table-

    BUYER_ID    |    ITEM_ID   | CREATED_TIME
    ------------+------------------+-----------------------
    1015826235      220003038067        2012-07-09 19:40:21,
    1015826235      300003861266        2012-07-09 18:19:59,
    1015826235      140002997245        2012-07-09 09:23:17,
    1015826235      210002448035        2012-07-09 22:21:11,
    1015826235      260003553381        2012-07-09 07:09:56,
    1015826235      260003553382        2012-07-09 19:40:39,
    1015826235      260003553383        2012-07-09 06:58:47,
    1015826235      260003553384        2012-07-09 07:28:47,
    1015826235      260003553385        2012-07-09 08:48:47,
    1015826235      260003553386        2012-07-09 06:38:47,
    1015826235      260003553387        2012-07-09 05:38:47,
    1015826235      260003553388        2012-07-09 04:55:47,
    1015826235      260003553389        2012-07-09 06:54:37,
    34512201        597245693           2012-07-09 16:20:21,
    34512201        8071787728          2012-07-09 15:19:59,
    34512201        5868222883          2012-07-09 08:23:17,
    34512201        2412180494          2012-07-09 22:21:11,
    34512201        2422054205          2012-07-09 06:09:56,
    34512201        1875744030          2012-07-09 19:40:39,
    34512201        5639158173          2012-07-09 06:58:47,
    34512201        5656232360          2012-07-09 07:28:47,
    34512201        959188449 2012-07-09 08:48:47,
    34512201        4645350592          2012-07-09 06:38:47,
    34512201        5657320532 2012-07-09 05:38:47,
    34512201        290419656539 2012-07-09 04:55:47,

If you see the above data in the table, there are only two UNIQUE
`BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
I need only 10 latest record basis on the time for that given day for each
`BUYER_ID`.

So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
`CREATED_TIME` for that given day only, it means for today's date I need 10
latest record for each `BUYER_ID`.

And each `BUYER_ID` can have any day's data. But I am specifically
interested for day before  today's data(means yesterday's date always) by
checking at the `CREATED_TIME`

**Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**

Sample Output.

    BUYER_ID    |    ITEM_ID   | CREATED_TIME
    ------------+------------------+-----------------------
    34512201        2412180494          2012-07-09 22:21:11
    34512201        1875744030          2012-07-09 19:40:39
    34512201        597245693           2012-07-09 16:20:21
    34512201        8071787728          2012-07-09 15:19:59
    34512201        959188449 2012-07-09 08:48:47
    34512201        5868222883          2012-07-09 08:23:17
    34512201        5656232360          2012-07-09 07:28:47
    34512201        5639158173          2012-07-09 06:58:47
    34512201        4645350592          2012-07-09 06:38:47
    34512201        2422054205          2012-07-09 06:09:56
    1015826235      210002448035        2012-07-09 22:21:11
    1015826235      260003553382        2012-07-09 19:40:39
    1015826235      220003038067        2012-07-09 19:40:21
    1015826235      300003861266        2012-07-09 18:19:59
    1015826235      140002997245        2012-07-09 09:23:17
    1015826235      260003553385        2012-07-09 08:48:47
    1015826235      260003553384        2012-07-09 07:28:47
    1015826235      260003553381        2012-07-09 07:09:56
    1015826235      260003553383        2012-07-09 06:58:47
    1015826235      260003553389        2012-07-09 06:54:37


*Raihan Jamal*



On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:

> **
> hello, you can use "desc" and "limit 10" to filter the top 10.
>
> 2012-07-10
>  ------------------------------
>  **
> Best Regards
> Andes
>
> **
>  ------------------------------
>  *发件人:*Raihan Jamal
> *发送时间:*2012-07-10 10:31
> *主题:*Find TOP 10 using HiveQL
> *收件人:*"user"<us...@hive.apache.org>
> *抄送:*
>
>  When I run this query,
>
> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
> DESC;
>
>
> I always get error as-
>
>  *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
> expression*
>
> Is there any way around to use TOP 10 or something similar that will work
> in HiveQL?
>
>
>
> *Raihan Jamal*
>
>

回复: Find TOP 10 using HiveQL

Posted by Andes <yl...@163.com>.
hello, you can use "desc" and "limit 10" to filter the top 10.

2012-07-10



Best Regards
Andes




发件人:Raihan Jamal
发送时间:2012-07-10 10:31
主题:Find TOP 10 using HiveQL
收件人:"user"<us...@hive.apache.org>
抄送:

When I run this query,


SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;





I always get error as-


FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select expression


Is there any way around to use TOP 10 or something similar that will work in HiveQL?



Raihan Jamal