You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Raihan Jamal <ja...@gmail.com> on 2012/07/10 04:30:53 UTC
Find TOP 10 using HiveQL
When I run this query,
SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;
I always get error as-
*FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
expression*
Is there any way around to use TOP 10 or something similar that will work
in HiveQL?
*Raihan Jamal*
Re: Find TOP 10 using HiveQL
Posted by Wouter de Bie <wo...@spotify.com>.
You could use TRANSFORM with a simple awk script:
TRANSFORM(a, b, c, d)
USING "/usr/bin/awk '
{if($1!=c){c=$1; a=0}; if(a<20){print $0; a++}}'"
This will create a top 20 for each group.
--Wouter de Bie
Team Lead Analytics Infrastructure, Spotify
wouter@spotify.com (mailto:wouter@spotify.com)
+46 72 018 0777
On Tuesday, July 10, 2012 at 9:15 AM, Raihan Jamal wrote:
> I am trying that solution. Currently I am running my query to see what result I am getting back with UDF.
>
>
>
>
> Raihan Jamal
>
>
> On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <nitinpawar432@gmail.com (mailto:nitinpawar432@gmail.com)> wrote:
> > i thought you managed to solve this with rank??
> >
> >
> > On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <jamalraihan@gmail.com (mailto:jamalraihan@gmail.com)> wrote:
> > > Problem with that approach is, with LIMIT 10, If I am putting after desc, then it will get only 10 rows irrespective of BUYER_ID. But I need specifically for each BUYER_ID 10 latest rows.
> > >
> > >
> > >
> > >
> > > Raihan Jamal
> > >
> > >
> > >
> > > On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <abhishektiwari.btech@gmail.com (mailto:abhishektiwari.btech@gmail.com)> wrote:
> > > > Raihan,
> > > >
> > > > Andes suggests you use 'limit' after 'desc' ie Hive with sort your query results in descending order and then chop off any more than first 10 records.
> > > > However, Hive will still run the full scan on data since it has to find the sorted list of records for you.
> > > >
> > > >
> > > > On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <jamalraihan@gmail.com (mailto:jamalraihan@gmail.com)> wrote:
> > > > > This is my below requirement. I need- Find `TOP 10` data for each `BUYER_ID, So I cannot use LIMIT 10 here in this case.
> > > > >
> > > > > This is the below table
> > > > >
> > > > > CREATE TABLE IF NOT EXISTS TestingTable1
> > > > > (
> > > > > BUYER_ID BIGINT,
> > > > > ITEM_ID BIGINT,
> > > > > CREATED_TIME STRING
> > > > > )
> > > > >
> > > > > And this is the below data in the above table-
> > > > >
> > > > > BUYER_ID | ITEM_ID | CREATED_TIME
> > > > > ------------+------------------+-----------------------
> > > > > 1015826235 220003038067 2012-07-09 19:40:21,
> > > > > 1015826235 300003861266 2012-07-09 18:19:59,
> > > > > 1015826235 140002997245 2012-07-09 09:23:17,
> > > > > 1015826235 210002448035 2012-07-09 22:21:11,
> > > > > 1015826235 260003553381 2012-07-09 07:09:56,
> > > > > 1015826235 260003553382 2012-07-09 19:40:39,
> > > > > 1015826235 260003553383 2012-07-09 06:58:47,
> > > > > 1015826235 260003553384 2012-07-09 07:28:47,
> > > > > 1015826235 260003553385 2012-07-09 08:48:47,
> > > > > 1015826235 260003553386 2012-07-09 06:38:47,
> > > > > 1015826235 260003553387 2012-07-09 05:38:47,
> > > > > 1015826235 260003553388 2012-07-09 04:55:47,
> > > > > 1015826235 260003553389 2012-07-09 06:54:37,
> > > > > 34512201 597245693 2012-07-09 16:20:21,
> > > > > 34512201 8071787728 2012-07-09 15:19:59,
> > > > > 34512201 5868222883 (tel:5868222883) 2012-07-09 08:23:17,
> > > > > 34512201 2412180494 2012-07-09 22:21:11,
> > > > > 34512201 2422054205 2012-07-09 06:09:56,
> > > > > 34512201 1875744030 2012-07-09 19:40:39,
> > > > > 34512201 5639158173 (tel:5639158173) 2012-07-09 06:58:47,
> > > > > 34512201 5656232360 2012-07-09 07:28:47,
> > > > > 34512201 959188449 2012-07-09 08:48:47,
> > > > > 34512201 4645350592 2012-07-09 06:38:47,
> > > > > 34512201 5657320532 2012-07-09 05:38:47,
> > > > > 34512201 290419656539 2012-07-09 04:55:47,
> > > > >
> > > > > If you see the above data in the table, there are only two UNIQUE `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`. I need only 10 latest record basis on the time for that given day for each `BUYER_ID`.
> > > > >
> > > > > So for this `BUYER_ID` - `34512201` I need 10 latest record basis on `CREATED_TIME` for that given day only, it means for today's date I need 10 latest record for each `BUYER_ID`.
> > > > >
> > > > > And each `BUYER_ID` can have any day's data. But I am specifically interested for day before today's data(means yesterday's date always) by checking at the `CREATED_TIME`
> > > > >
> > > > > **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
> > > > >
> > > > > Sample Output.
> > > > >
> > > > > BUYER_ID | ITEM_ID | CREATED_TIME
> > > > > ------------+------------------+-----------------------
> > > > > 34512201 2412180494 2012-07-09 22:21:11
> > > > > 34512201 1875744030 2012-07-09 19:40:39
> > > > > 34512201 597245693 2012-07-09 16:20:21
> > > > > 34512201 8071787728 2012-07-09 15:19:59
> > > > > 34512201 959188449 2012-07-09 08:48:47
> > > > > 34512201 5868222883 (tel:5868222883) 2012-07-09 08:23:17
> > > > > 34512201 5656232360 2012-07-09 07:28:47
> > > > > 34512201 5639158173 (tel:5639158173) 2012-07-09 06:58:47
> > > > > 34512201 4645350592 2012-07-09 06:38:47
> > > > > 34512201 2422054205 2012-07-09 06:09:56
> > > > > 1015826235 210002448035 2012-07-09 22:21:11
> > > > > 1015826235 260003553382 2012-07-09 19:40:39
> > > > > 1015826235 220003038067 2012-07-09 19:40:21
> > > > > 1015826235 300003861266 2012-07-09 18:19:59
> > > > > 1015826235 140002997245 2012-07-09 09:23:17
> > > > > 1015826235 260003553385 2012-07-09 08:48:47
> > > > > 1015826235 260003553384 2012-07-09 07:28:47
> > > > > 1015826235 260003553381 2012-07-09 07:09:56
> > > > > 1015826235 260003553383 2012-07-09 06:58:47
> > > > > 1015826235 260003553389 2012-07-09 06:54:37
> > > > >
> > > > >
> > > > > Raihan Jamal
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 9, 2012 at 7:56 PM, Andes <ylyy-1985@163.com (mailto:ylyy-1985@163.com)> wrote:
> > > > > > hello, you can use "desc" and "limit 10" to filter the top 10.
> > > > > >
> > > > > > 2012-07-10
> > > > > >
> > > > > > Best Regards
> > > > > > Andes
> > > > > >
> > > > > >
> > > > > >
> > > > > > 发件人:Raihan Jamal
> > > > > > 发送时间:2012-07-10 10 (tel:2012-07-10%C2%A010):31
> > > > > > 主题:Find TOP 10 using HiveQL
> > > > > > 收件人:"user"<user@hive.apache.org (mailto:user@hive.apache.org)>
> > > > > > 抄送:
> > > > > >
> > > > > > When I run this query,
> > > > > >
> > > > > > SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;
> > > > > >
> > > > > >
> > > > > > I always get error as-
> > > > > >
> > > > > > FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select expression
> > > > > >
> > > > > > Is there any way around to use TOP 10 or something similar that will work in HiveQL?
> > > > > >
> > > > > >
> > > > > > Raihan Jamal
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>
Re: Find TOP 10 using HiveQL
Posted by Raihan Jamal <ja...@gmail.com>.
I am trying that solution. Currently I am running my query to see what
result I am getting back with UDF.
*Raihan Jamal*
On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <ni...@gmail.com>wrote:
> i thought you managed to solve this with rank??
>
>
> On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> Problem with that approach is, with LIMIT 10, If I am putting after desc,
>> then it will get only 10 rows irrespective of BUYER_ID. But I need
>> specifically for each BUYER_ID 10 latest rows.
>>
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
>> abhishektiwari.btech@gmail.com> wrote:
>>
>>> Raihan,
>>>
>>> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
>>> results in descending order and then chop off any more than first 10
>>> records.
>>> However, Hive will still run the full scan on data since it has to find
>>> the sorted list of records for you.
>>>
>>>
>>> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>>
>>>> This is my below requirement. I need- *Find `TOP 10` data for each
>>>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>>>
>>>> This is the below table
>>>>
>>>> CREATE TABLE IF NOT EXISTS TestingTable1
>>>> (
>>>> BUYER_ID BIGINT,
>>>> ITEM_ID BIGINT,
>>>> CREATED_TIME STRING
>>>> )
>>>>
>>>> And this is the below data in the above table-
>>>>
>>>> BUYER_ID | ITEM_ID | CREATED_TIME
>>>> ------------+------------------+-----------------------
>>>> 1015826235 220003038067 2012-07-09 19:40:21,
>>>> 1015826235 300003861266 2012-07-09 18:19:59,
>>>> 1015826235 140002997245 2012-07-09 09:23:17,
>>>> 1015826235 210002448035 2012-07-09 22:21:11,
>>>> 1015826235 260003553381 2012-07-09 07:09:56,
>>>> 1015826235 260003553382 2012-07-09 19:40:39,
>>>> 1015826235 260003553383 2012-07-09 06:58:47,
>>>> 1015826235 260003553384 2012-07-09 07:28:47,
>>>> 1015826235 260003553385 2012-07-09 08:48:47,
>>>> 1015826235 260003553386 2012-07-09 06:38:47,
>>>> 1015826235 260003553387 2012-07-09 05:38:47,
>>>> 1015826235 260003553388 2012-07-09 04:55:47,
>>>> 1015826235 260003553389 2012-07-09 06:54:37,
>>>> 34512201 597245693 2012-07-09 16:20:21,
>>>> 34512201 8071787728 2012-07-09 15:19:59,
>>>> 34512201 5868222883 2012-07-09 08:23:17,
>>>> 34512201 2412180494 2012-07-09 22:21:11,
>>>> 34512201 2422054205 2012-07-09 06:09:56,
>>>> 34512201 1875744030 2012-07-09 19:40:39,
>>>> 34512201 5639158173 2012-07-09 06:58:47,
>>>> 34512201 5656232360 2012-07-09 07:28:47,
>>>> 34512201 959188449 2012-07-09 08:48:47,
>>>> 34512201 4645350592 2012-07-09 06:38:47,
>>>> 34512201 5657320532 2012-07-09 05:38:47,
>>>> 34512201 290419656539 2012-07-09 04:55:47,
>>>>
>>>> If you see the above data in the table, there are only two UNIQUE
>>>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>>>> I need only 10 latest record basis on the time for that given day for each
>>>> `BUYER_ID`.
>>>>
>>>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>>>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>>>> latest record for each `BUYER_ID`.
>>>>
>>>> And each `BUYER_ID` can have any day's data. But I am specifically
>>>> interested for day before today's data(means yesterday's date always) by
>>>> checking at the `CREATED_TIME`
>>>>
>>>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>>>
>>>> Sample Output.
>>>>
>>>> BUYER_ID | ITEM_ID | CREATED_TIME
>>>> ------------+------------------+-----------------------
>>>> 34512201 2412180494 2012-07-09 22:21:11
>>>> 34512201 1875744030 2012-07-09 19:40:39
>>>> 34512201 597245693 2012-07-09 16:20:21
>>>> 34512201 8071787728 2012-07-09 15:19:59
>>>> 34512201 959188449 2012-07-09 08:48:47
>>>> 34512201 5868222883 2012-07-09 08:23:17
>>>> 34512201 5656232360 2012-07-09 07:28:47
>>>> 34512201 5639158173 2012-07-09 06:58:47
>>>> 34512201 4645350592 2012-07-09 06:38:47
>>>> 34512201 2422054205 2012-07-09 06:09:56
>>>> 1015826235 210002448035 2012-07-09 22:21:11
>>>> 1015826235 260003553382 2012-07-09 19:40:39
>>>> 1015826235 220003038067 2012-07-09 19:40:21
>>>> 1015826235 300003861266 2012-07-09 18:19:59
>>>> 1015826235 140002997245 2012-07-09 09:23:17
>>>> 1015826235 260003553385 2012-07-09 08:48:47
>>>> 1015826235 260003553384 2012-07-09 07:28:47
>>>> 1015826235 260003553381 2012-07-09 07:09:56
>>>> 1015826235 260003553383 2012-07-09 06:58:47
>>>> 1015826235 260003553389 2012-07-09 06:54:37
>>>>
>>>>
>>>> *Raihan Jamal*
>>>>
>>>>
>>>>
>>>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>>>
>>>>> **
>>>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>>>
>>>>> 2012-07-10
>>>>> ------------------------------
>>>>> **
>>>>> Best Regards
>>>>> Andes
>>>>>
>>>>> **
>>>>> ------------------------------
>>>>> *发件人:*Raihan Jamal
>>>>> *发送时间:*2012-07-10 10:31
>>>>> *主题:*Find TOP 10 using HiveQL
>>>>> *收件人:*"user"<us...@hive.apache.org>
>>>>> *抄送:*
>>>>>
>>>>> When I run this query,
>>>>>
>>>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>>>> DESC;
>>>>>
>>>>>
>>>>> I always get error as-
>>>>>
>>>>> *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in
>>>>> select expression*
>>>>>
>>>>> Is there any way around to use TOP 10 or something similar that will
>>>>> work in HiveQL?
>>>>>
>>>>>
>>>>>
>>>>> *Raihan Jamal*
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Nitin Pawar
>
>
Re: Find TOP 10 using HiveQL
Posted by Nitin Pawar <ni...@gmail.com>.
i thought you managed to solve this with rank??
On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <ja...@gmail.com>wrote:
> Problem with that approach is, with LIMIT 10, If I am putting after desc,
> then it will get only 10 rows irrespective of BUYER_ID. But I need
> specifically for each BUYER_ID 10 latest rows.
>
>
>
>
> *Raihan Jamal*
>
>
>
> On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
> abhishektiwari.btech@gmail.com> wrote:
>
>> Raihan,
>>
>> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
>> results in descending order and then chop off any more than first 10
>> records.
>> However, Hive will still run the full scan on data since it has to find
>> the sorted list of records for you.
>>
>>
>> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>>
>>> This is my below requirement. I need- *Find `TOP 10` data for each
>>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>>
>>> This is the below table
>>>
>>> CREATE TABLE IF NOT EXISTS TestingTable1
>>> (
>>> BUYER_ID BIGINT,
>>> ITEM_ID BIGINT,
>>> CREATED_TIME STRING
>>> )
>>>
>>> And this is the below data in the above table-
>>>
>>> BUYER_ID | ITEM_ID | CREATED_TIME
>>> ------------+------------------+-----------------------
>>> 1015826235 220003038067 2012-07-09 19:40:21,
>>> 1015826235 300003861266 2012-07-09 18:19:59,
>>> 1015826235 140002997245 2012-07-09 09:23:17,
>>> 1015826235 210002448035 2012-07-09 22:21:11,
>>> 1015826235 260003553381 2012-07-09 07:09:56,
>>> 1015826235 260003553382 2012-07-09 19:40:39,
>>> 1015826235 260003553383 2012-07-09 06:58:47,
>>> 1015826235 260003553384 2012-07-09 07:28:47,
>>> 1015826235 260003553385 2012-07-09 08:48:47,
>>> 1015826235 260003553386 2012-07-09 06:38:47,
>>> 1015826235 260003553387 2012-07-09 05:38:47,
>>> 1015826235 260003553388 2012-07-09 04:55:47,
>>> 1015826235 260003553389 2012-07-09 06:54:37,
>>> 34512201 597245693 2012-07-09 16:20:21,
>>> 34512201 8071787728 2012-07-09 15:19:59,
>>> 34512201 5868222883 2012-07-09 08:23:17,
>>> 34512201 2412180494 2012-07-09 22:21:11,
>>> 34512201 2422054205 2012-07-09 06:09:56,
>>> 34512201 1875744030 2012-07-09 19:40:39,
>>> 34512201 5639158173 2012-07-09 06:58:47,
>>> 34512201 5656232360 2012-07-09 07:28:47,
>>> 34512201 959188449 2012-07-09 08:48:47,
>>> 34512201 4645350592 2012-07-09 06:38:47,
>>> 34512201 5657320532 2012-07-09 05:38:47,
>>> 34512201 290419656539 2012-07-09 04:55:47,
>>>
>>> If you see the above data in the table, there are only two UNIQUE
>>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>>> I need only 10 latest record basis on the time for that given day for each
>>> `BUYER_ID`.
>>>
>>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>>> latest record for each `BUYER_ID`.
>>>
>>> And each `BUYER_ID` can have any day's data. But I am specifically
>>> interested for day before today's data(means yesterday's date always) by
>>> checking at the `CREATED_TIME`
>>>
>>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>>
>>> Sample Output.
>>>
>>> BUYER_ID | ITEM_ID | CREATED_TIME
>>> ------------+------------------+-----------------------
>>> 34512201 2412180494 2012-07-09 22:21:11
>>> 34512201 1875744030 2012-07-09 19:40:39
>>> 34512201 597245693 2012-07-09 16:20:21
>>> 34512201 8071787728 2012-07-09 15:19:59
>>> 34512201 959188449 2012-07-09 08:48:47
>>> 34512201 5868222883 2012-07-09 08:23:17
>>> 34512201 5656232360 2012-07-09 07:28:47
>>> 34512201 5639158173 2012-07-09 06:58:47
>>> 34512201 4645350592 2012-07-09 06:38:47
>>> 34512201 2422054205 2012-07-09 06:09:56
>>> 1015826235 210002448035 2012-07-09 22:21:11
>>> 1015826235 260003553382 2012-07-09 19:40:39
>>> 1015826235 220003038067 2012-07-09 19:40:21
>>> 1015826235 300003861266 2012-07-09 18:19:59
>>> 1015826235 140002997245 2012-07-09 09:23:17
>>> 1015826235 260003553385 2012-07-09 08:48:47
>>> 1015826235 260003553384 2012-07-09 07:28:47
>>> 1015826235 260003553381 2012-07-09 07:09:56
>>> 1015826235 260003553383 2012-07-09 06:58:47
>>> 1015826235 260003553389 2012-07-09 06:54:37
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>>
>>>> **
>>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>>
>>>> 2012-07-10
>>>> ------------------------------
>>>> **
>>>> Best Regards
>>>> Andes
>>>>
>>>> **
>>>> ------------------------------
>>>> *发件人:*Raihan Jamal
>>>> *发送时间:*2012-07-10 10:31
>>>> *主题:*Find TOP 10 using HiveQL
>>>> *收件人:*"user"<us...@hive.apache.org>
>>>> *抄送:*
>>>>
>>>> When I run this query,
>>>>
>>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>>> DESC;
>>>>
>>>>
>>>> I always get error as-
>>>>
>>>> *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>>>> expression*
>>>>
>>>> Is there any way around to use TOP 10 or something similar that will
>>>> work in HiveQL?
>>>>
>>>>
>>>>
>>>> *Raihan Jamal*
>>>>
>>>>
>>>
>>
>>
>>
>>
>>
>
--
Nitin Pawar
Re: Find TOP 10 using HiveQL
Posted by Raihan Jamal <ja...@gmail.com>.
Problem with that approach is, with LIMIT 10, If I am putting after desc,
then it will get only 10 rows irrespective of BUYER_ID. But I need
specifically for each BUYER_ID 10 latest rows.
*Raihan Jamal*
On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <
abhishektiwari.btech@gmail.com> wrote:
> Raihan,
>
> Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
> results in descending order and then chop off any more than first 10
> records.
> However, Hive will still run the full scan on data since it has to find
> the sorted list of records for you.
>
>
> On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com>wrote:
>
>> This is my below requirement. I need- *Find `TOP 10` data for each
>> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>>
>> This is the below table
>>
>> CREATE TABLE IF NOT EXISTS TestingTable1
>> (
>> BUYER_ID BIGINT,
>> ITEM_ID BIGINT,
>> CREATED_TIME STRING
>> )
>>
>> And this is the below data in the above table-
>>
>> BUYER_ID | ITEM_ID | CREATED_TIME
>> ------------+------------------+-----------------------
>> 1015826235 220003038067 2012-07-09 19:40:21,
>> 1015826235 300003861266 2012-07-09 18:19:59,
>> 1015826235 140002997245 2012-07-09 09:23:17,
>> 1015826235 210002448035 2012-07-09 22:21:11,
>> 1015826235 260003553381 2012-07-09 07:09:56,
>> 1015826235 260003553382 2012-07-09 19:40:39,
>> 1015826235 260003553383 2012-07-09 06:58:47,
>> 1015826235 260003553384 2012-07-09 07:28:47,
>> 1015826235 260003553385 2012-07-09 08:48:47,
>> 1015826235 260003553386 2012-07-09 06:38:47,
>> 1015826235 260003553387 2012-07-09 05:38:47,
>> 1015826235 260003553388 2012-07-09 04:55:47,
>> 1015826235 260003553389 2012-07-09 06:54:37,
>> 34512201 597245693 2012-07-09 16:20:21,
>> 34512201 8071787728 2012-07-09 15:19:59,
>> 34512201 5868222883 2012-07-09 08:23:17,
>> 34512201 2412180494 2012-07-09 22:21:11,
>> 34512201 2422054205 2012-07-09 06:09:56,
>> 34512201 1875744030 2012-07-09 19:40:39,
>> 34512201 5639158173 2012-07-09 06:58:47,
>> 34512201 5656232360 2012-07-09 07:28:47,
>> 34512201 959188449 2012-07-09 08:48:47,
>> 34512201 4645350592 2012-07-09 06:38:47,
>> 34512201 5657320532 2012-07-09 05:38:47,
>> 34512201 290419656539 2012-07-09 04:55:47,
>>
>> If you see the above data in the table, there are only two UNIQUE
>> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
>> I need only 10 latest record basis on the time for that given day for each
>> `BUYER_ID`.
>>
>> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
>> `CREATED_TIME` for that given day only, it means for today's date I need 10
>> latest record for each `BUYER_ID`.
>>
>> And each `BUYER_ID` can have any day's data. But I am specifically
>> interested for day before today's data(means yesterday's date always) by
>> checking at the `CREATED_TIME`
>>
>> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>>
>> Sample Output.
>>
>> BUYER_ID | ITEM_ID | CREATED_TIME
>> ------------+------------------+-----------------------
>> 34512201 2412180494 2012-07-09 22:21:11
>> 34512201 1875744030 2012-07-09 19:40:39
>> 34512201 597245693 2012-07-09 16:20:21
>> 34512201 8071787728 2012-07-09 15:19:59
>> 34512201 959188449 2012-07-09 08:48:47
>> 34512201 5868222883 2012-07-09 08:23:17
>> 34512201 5656232360 2012-07-09 07:28:47
>> 34512201 5639158173 2012-07-09 06:58:47
>> 34512201 4645350592 2012-07-09 06:38:47
>> 34512201 2422054205 2012-07-09 06:09:56
>> 1015826235 210002448035 2012-07-09 22:21:11
>> 1015826235 260003553382 2012-07-09 19:40:39
>> 1015826235 220003038067 2012-07-09 19:40:21
>> 1015826235 300003861266 2012-07-09 18:19:59
>> 1015826235 140002997245 2012-07-09 09:23:17
>> 1015826235 260003553385 2012-07-09 08:48:47
>> 1015826235 260003553384 2012-07-09 07:28:47
>> 1015826235 260003553381 2012-07-09 07:09:56
>> 1015826235 260003553383 2012-07-09 06:58:47
>> 1015826235 260003553389 2012-07-09 06:54:37
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>>
>>> **
>>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>>
>>> 2012-07-10
>>> ------------------------------
>>> **
>>> Best Regards
>>> Andes
>>>
>>> **
>>> ------------------------------
>>> *发件人:*Raihan Jamal
>>> *发送时间:*2012-07-10 10:31
>>> *主题:*Find TOP 10 using HiveQL
>>> *收件人:*"user"<us...@hive.apache.org>
>>> *抄送:*
>>>
>>> When I run this query,
>>>
>>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>>> DESC;
>>>
>>>
>>> I always get error as-
>>>
>>> *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>>> expression*
>>>
>>> Is there any way around to use TOP 10 or something similar that will
>>> work in HiveQL?
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>
>
>
>
>
>
Re: Find TOP 10 using HiveQL
Posted by Abhishek Tiwari <ab...@gmail.com>.
Raihan,
Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
results in descending order and then chop off any more than first 10
records.
However, Hive will still run the full scan on data since it has to find the
sorted list of records for you.
On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <ja...@gmail.com> wrote:
> This is my below requirement. I need- *Find `TOP 10` data for each
> `BUYER_ID, *So I cannot use LIMIT 10 here in this case.
>
> This is the below table
>
> CREATE TABLE IF NOT EXISTS TestingTable1
> (
> BUYER_ID BIGINT,
> ITEM_ID BIGINT,
> CREATED_TIME STRING
> )
>
> And this is the below data in the above table-
>
> BUYER_ID | ITEM_ID | CREATED_TIME
> ------------+------------------+-----------------------
> 1015826235 220003038067 2012-07-09 19:40:21,
> 1015826235 300003861266 2012-07-09 18:19:59,
> 1015826235 140002997245 2012-07-09 09:23:17,
> 1015826235 210002448035 2012-07-09 22:21:11,
> 1015826235 260003553381 2012-07-09 07:09:56,
> 1015826235 260003553382 2012-07-09 19:40:39,
> 1015826235 260003553383 2012-07-09 06:58:47,
> 1015826235 260003553384 2012-07-09 07:28:47,
> 1015826235 260003553385 2012-07-09 08:48:47,
> 1015826235 260003553386 2012-07-09 06:38:47,
> 1015826235 260003553387 2012-07-09 05:38:47,
> 1015826235 260003553388 2012-07-09 04:55:47,
> 1015826235 260003553389 2012-07-09 06:54:37,
> 34512201 597245693 2012-07-09 16:20:21,
> 34512201 8071787728 2012-07-09 15:19:59,
> 34512201 5868222883 2012-07-09 08:23:17,
> 34512201 2412180494 2012-07-09 22:21:11,
> 34512201 2422054205 2012-07-09 06:09:56,
> 34512201 1875744030 2012-07-09 19:40:39,
> 34512201 5639158173 2012-07-09 06:58:47,
> 34512201 5656232360 2012-07-09 07:28:47,
> 34512201 959188449 2012-07-09 08:48:47,
> 34512201 4645350592 2012-07-09 06:38:47,
> 34512201 5657320532 2012-07-09 05:38:47,
> 34512201 290419656539 2012-07-09 04:55:47,
>
> If you see the above data in the table, there are only two UNIQUE
> `BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
> I need only 10 latest record basis on the time for that given day for each
> `BUYER_ID`.
>
> So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
> `CREATED_TIME` for that given day only, it means for today's date I need 10
> latest record for each `BUYER_ID`.
>
> And each `BUYER_ID` can have any day's data. But I am specifically
> interested for day before today's data(means yesterday's date always) by
> checking at the `CREATED_TIME`
>
> **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
>
> Sample Output.
>
> BUYER_ID | ITEM_ID | CREATED_TIME
> ------------+------------------+-----------------------
> 34512201 2412180494 2012-07-09 22:21:11
> 34512201 1875744030 2012-07-09 19:40:39
> 34512201 597245693 2012-07-09 16:20:21
> 34512201 8071787728 2012-07-09 15:19:59
> 34512201 959188449 2012-07-09 08:48:47
> 34512201 5868222883 2012-07-09 08:23:17
> 34512201 5656232360 2012-07-09 07:28:47
> 34512201 5639158173 2012-07-09 06:58:47
> 34512201 4645350592 2012-07-09 06:38:47
> 34512201 2422054205 2012-07-09 06:09:56
> 1015826235 210002448035 2012-07-09 22:21:11
> 1015826235 260003553382 2012-07-09 19:40:39
> 1015826235 220003038067 2012-07-09 19:40:21
> 1015826235 300003861266 2012-07-09 18:19:59
> 1015826235 140002997245 2012-07-09 09:23:17
> 1015826235 260003553385 2012-07-09 08:48:47
> 1015826235 260003553384 2012-07-09 07:28:47
> 1015826235 260003553381 2012-07-09 07:09:56
> 1015826235 260003553383 2012-07-09 06:58:47
> 1015826235 260003553389 2012-07-09 06:54:37
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
>
>> **
>> hello, you can use "desc" and "limit 10" to filter the top 10.
>>
>> 2012-07-10
>> ------------------------------
>> **
>> Best Regards
>> Andes
>>
>> **
>> ------------------------------
>> *发件人:*Raihan Jamal
>> *发送时间:*2012-07-10 10:31
>> *主题:*Find TOP 10 using HiveQL
>> *收件人:*"user"<us...@hive.apache.org>
>> *抄送:*
>>
>> When I run this query,
>>
>> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
>> DESC;
>>
>>
>> I always get error as-
>>
>> *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
>> expression*
>>
>> Is there any way around to use TOP 10 or something similar that will work
>> in HiveQL?
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>
Re: Find TOP 10 using HiveQL
Posted by Raihan Jamal <ja...@gmail.com>.
This is my below requirement. I need- *Find `TOP 10` data for each
`BUYER_ID, *So I cannot use LIMIT 10 here in this case.
This is the below table
CREATE TABLE IF NOT EXISTS TestingTable1
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)
And this is the below data in the above table-
BUYER_ID | ITEM_ID | CREATED_TIME
------------+------------------+-----------------------
1015826235 220003038067 2012-07-09 19:40:21,
1015826235 300003861266 2012-07-09 18:19:59,
1015826235 140002997245 2012-07-09 09:23:17,
1015826235 210002448035 2012-07-09 22:21:11,
1015826235 260003553381 2012-07-09 07:09:56,
1015826235 260003553382 2012-07-09 19:40:39,
1015826235 260003553383 2012-07-09 06:58:47,
1015826235 260003553384 2012-07-09 07:28:47,
1015826235 260003553385 2012-07-09 08:48:47,
1015826235 260003553386 2012-07-09 06:38:47,
1015826235 260003553387 2012-07-09 05:38:47,
1015826235 260003553388 2012-07-09 04:55:47,
1015826235 260003553389 2012-07-09 06:54:37,
34512201 597245693 2012-07-09 16:20:21,
34512201 8071787728 2012-07-09 15:19:59,
34512201 5868222883 2012-07-09 08:23:17,
34512201 2412180494 2012-07-09 22:21:11,
34512201 2422054205 2012-07-09 06:09:56,
34512201 1875744030 2012-07-09 19:40:39,
34512201 5639158173 2012-07-09 06:58:47,
34512201 5656232360 2012-07-09 07:28:47,
34512201 959188449 2012-07-09 08:48:47,
34512201 4645350592 2012-07-09 06:38:47,
34512201 5657320532 2012-07-09 05:38:47,
34512201 290419656539 2012-07-09 04:55:47,
If you see the above data in the table, there are only two UNIQUE
`BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`.
I need only 10 latest record basis on the time for that given day for each
`BUYER_ID`.
So for this `BUYER_ID` - `34512201` I need 10 latest record basis on
`CREATED_TIME` for that given day only, it means for today's date I need 10
latest record for each `BUYER_ID`.
And each `BUYER_ID` can have any day's data. But I am specifically
interested for day before today's data(means yesterday's date always) by
checking at the `CREATED_TIME`
**Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
Sample Output.
BUYER_ID | ITEM_ID | CREATED_TIME
------------+------------------+-----------------------
34512201 2412180494 2012-07-09 22:21:11
34512201 1875744030 2012-07-09 19:40:39
34512201 597245693 2012-07-09 16:20:21
34512201 8071787728 2012-07-09 15:19:59
34512201 959188449 2012-07-09 08:48:47
34512201 5868222883 2012-07-09 08:23:17
34512201 5656232360 2012-07-09 07:28:47
34512201 5639158173 2012-07-09 06:58:47
34512201 4645350592 2012-07-09 06:38:47
34512201 2422054205 2012-07-09 06:09:56
1015826235 210002448035 2012-07-09 22:21:11
1015826235 260003553382 2012-07-09 19:40:39
1015826235 220003038067 2012-07-09 19:40:21
1015826235 300003861266 2012-07-09 18:19:59
1015826235 140002997245 2012-07-09 09:23:17
1015826235 260003553385 2012-07-09 08:48:47
1015826235 260003553384 2012-07-09 07:28:47
1015826235 260003553381 2012-07-09 07:09:56
1015826235 260003553383 2012-07-09 06:58:47
1015826235 260003553389 2012-07-09 06:54:37
*Raihan Jamal*
On Mon, Jul 9, 2012 at 7:56 PM, Andes <yl...@163.com> wrote:
> **
> hello, you can use "desc" and "limit 10" to filter the top 10.
>
> 2012-07-10
> ------------------------------
> **
> Best Regards
> Andes
>
> **
> ------------------------------
> *发件人:*Raihan Jamal
> *发送时间:*2012-07-10 10:31
> *主题:*Find TOP 10 using HiveQL
> *收件人:*"user"<us...@hive.apache.org>
> *抄送:*
>
> When I run this query,
>
> SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
> DESC;
>
>
> I always get error as-
>
> *FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select
> expression*
>
> Is there any way around to use TOP 10 or something similar that will work
> in HiveQL?
>
>
>
> *Raihan Jamal*
>
>
回复: Find TOP 10 using HiveQL
Posted by Andes <yl...@163.com>.
hello, you can use "desc" and "limit 10" to filter the top 10.
2012-07-10
Best Regards
Andes
发件人:Raihan Jamal
发送时间:2012-07-10 10:31
主题:Find TOP 10 using HiveQL
收件人:"user"<us...@hive.apache.org>
抄送:
When I run this query,
SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;
I always get error as-
FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in select expression
Is there any way around to use TOP 10 or something similar that will work in HiveQL?
Raihan Jamal