You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Cyril Scetbon <cy...@free.fr> on 2009/07/24 09:32:53 UTC

Pig does not order data correctly

Hi,

I've imported data from a MySQL db thanks to sqoop. However when I try 
to order this data on 2 fields it does not return the same answer as 
MySQL does (which is the correct result)

Here is the code I use :

grunt> A = LOAD 'hdfs://hadoopM:54310/user/hadoop/rental' USING 
PigStorage(',') AS (rental_id:int, rental_date:chararray, 
inventory_id:int, customer_id:int);
grunt> B = ORDER A BY inventory_id DESC, customer_id ASC;
grunt> C = LIMIT B 20;
grunt> DUMP C;

Here is the result with Pig :

(132,2005-05-25 21:46:54.0,3367,479)
(263,2005-05-26 15:47:40.0,1160,449)
(324,2005-05-27 01:00:04.0,3364,292)
(359,2005-05-27 06:48:33.0,1156,152)
(582,2005-05-28 11:33:46.0,4579,198)
(711,2005-05-29 03:49:03.0,4581,215)
(809,2005-05-29 19:10:20.0,2114,222)
(927,2005-05-30 12:16:40.0,1158,167)
(1084,2005-05-31 11:10:17.0,4577,12)
(1341,2005-06-15 12:26:18.0,3363,344)
(1493,2005-06-15 21:50:32.0,4581,235)
(1537,2005-06-16 00:52:51.0,4577,594)
(1625,2005-06-16 07:49:08.0,3367,39)
(1729,2005-06-16 15:29:47.0,3364,523)
(1945,2005-06-17 07:51:26.0,3366,207)
(2137,2005-06-17 21:18:28.0,1158,581)
(2149,2005-06-17 22:50:00.0,3365,333)
(2321,2005-06-18 09:42:42.0,1160,565)
(2799,2005-06-19 19:15:21.0,4579,576)
(2806,2005-06-19 19:30:48.0,2114,510)

Here is the result with MySQL :

mysql> select rental_id, rental_date, inventory_id, customer_id from 
rental order by inventory_id desc , customer_id asc limit 20

+-----------+---------------------+--------------+-------------+
| rental_id | rental_date | inventory_id | customer_id |
+-----------+---------------------+--------------+-------------+
| 711 | 2005-05-29 03:49:03 | 4581 | 215 |
| 6712 | 2005-07-12 13:24:47 | 4581 | 226 |
| 1493 | 2005-06-15 21:50:32 | 4581 | 235 |
| 9701 | 2005-07-31 07:32:21 | 4581 | 401 |
| 12894 | 2005-08-19 03:49:28 | 4581 | 541 |
| 10479 | 2005-08-01 10:11:25 | 4580 | 275 |
| 15916 | 2005-08-23 17:56:01 | 4580 | 327 |
| 5274 | 2005-07-09 14:34:09 | 4579 | 108 |
| 582 | 2005-05-28 11:33:46 | 4579 | 198 |
| 12458 | 2005-08-18 11:22:53 | 4579 | 277 |
| 8289 | 2005-07-29 02:23:24 | 4579 | 459 |
| 2799 | 2005-06-19 19:15:21 | 4579 | 576 |
| 11453 | 2005-08-02 21:00:05 | 4578 | 84 |
| 12456 | 2005-08-18 11:21:51 | 4578 | 85 |
| 6664 | 2005-07-12 11:28:22 | 4578 | 351 |
| 1084 | 2005-05-31 11:10:17 | 4577 | 12 |
| 5972 | 2005-07-11 00:08:54 | 4577 | 30 |
| 12854 | 2005-08-19 02:18:51 | 4577 | 362 |
| 9644 | 2005-07-31 05:40:35 | 4577 | 441 |
| 1537 | 2005-06-16 00:52:51 | 4577 | 594 |
+-----------+---------------------+--------------+-------------+
20 rows in set (3.10 sec)

thanks

-- 
Cyril SCETBON


Re: Pig does not order data correctly

Posted by Tamir Kamara <ta...@gmail.com>.
Hi,

I seen this happen in past versions of pig. What version are you using?
There were a few issues with order by followed by limit that were fixed in
recent weeks:
https://issues.apache.org/jira/browse/PIG-797
https://issues.apache.org/jira/browse/PIG-876

With latest SVN it should work as expected (it does for me). You can try it
...

Tamir


On Fri, Jul 24, 2009 at 10:32 AM, Cyril Scetbon <cy...@free.fr>wrote:

> Hi,
>
> I've imported data from a MySQL db thanks to sqoop. However when I try to
> order this data on 2 fields it does not return the same answer as MySQL does
> (which is the correct result)
>
> Here is the code I use :
>
> grunt> A = LOAD 'hdfs://hadoopM:54310/user/hadoop/rental' USING
> PigStorage(',') AS (rental_id:int, rental_date:chararray, inventory_id:int,
> customer_id:int);
> grunt> B = ORDER A BY inventory_id DESC, customer_id ASC;
> grunt> C = LIMIT B 20;
> grunt> DUMP C;
>
> Here is the result with Pig :
>
> (132,2005-05-25 21:46:54.0,3367,479)
> (263,2005-05-26 15:47:40.0,1160,449)
> (324,2005-05-27 01:00:04.0,3364,292)
> (359,2005-05-27 06:48:33.0,1156,152)
> (582,2005-05-28 11:33:46.0,4579,198)
> (711,2005-05-29 03:49:03.0,4581,215)
> (809,2005-05-29 19:10:20.0,2114,222)
> (927,2005-05-30 12:16:40.0,1158,167)
> (1084,2005-05-31 11:10:17.0,4577,12)
> (1341,2005-06-15 12:26:18.0,3363,344)
> (1493,2005-06-15 21:50:32.0,4581,235)
> (1537,2005-06-16 00:52:51.0,4577,594)
> (1625,2005-06-16 07:49:08.0,3367,39)
> (1729,2005-06-16 15:29:47.0,3364,523)
> (1945,2005-06-17 07:51:26.0,3366,207)
> (2137,2005-06-17 21:18:28.0,1158,581)
> (2149,2005-06-17 22:50:00.0,3365,333)
> (2321,2005-06-18 09:42:42.0,1160,565)
> (2799,2005-06-19 19:15:21.0,4579,576)
> (2806,2005-06-19 19:30:48.0,2114,510)
>
> Here is the result with MySQL :
>
> mysql> select rental_id, rental_date, inventory_id, customer_id from rental
> order by inventory_id desc , customer_id asc limit 20
>
> +-----------+---------------------+--------------+-------------+
> | rental_id | rental_date | inventory_id | customer_id |
> +-----------+---------------------+--------------+-------------+
> | 711 | 2005-05-29 03:49:03 | 4581 | 215 |
> | 6712 | 2005-07-12 13:24:47 | 4581 | 226 |
> | 1493 | 2005-06-15 21:50:32 | 4581 | 235 |
> | 9701 | 2005-07-31 07:32:21 | 4581 | 401 |
> | 12894 | 2005-08-19 03:49:28 | 4581 | 541 |
> | 10479 | 2005-08-01 10:11:25 | 4580 | 275 |
> | 15916 | 2005-08-23 17:56:01 | 4580 | 327 |
> | 5274 | 2005-07-09 14:34:09 | 4579 | 108 |
> | 582 | 2005-05-28 11:33:46 | 4579 | 198 |
> | 12458 | 2005-08-18 11:22:53 | 4579 | 277 |
> | 8289 | 2005-07-29 02:23:24 | 4579 | 459 |
> | 2799 | 2005-06-19 19:15:21 | 4579 | 576 |
> | 11453 | 2005-08-02 21:00:05 | 4578 | 84 |
> | 12456 | 2005-08-18 11:21:51 | 4578 | 85 |
> | 6664 | 2005-07-12 11:28:22 | 4578 | 351 |
> | 1084 | 2005-05-31 11:10:17 | 4577 | 12 |
> | 5972 | 2005-07-11 00:08:54 | 4577 | 30 |
> | 12854 | 2005-08-19 02:18:51 | 4577 | 362 |
> | 9644 | 2005-07-31 05:40:35 | 4577 | 441 |
> | 1537 | 2005-06-16 00:52:51 | 4577 | 594 |
> +-----------+---------------------+--------------+-------------+
> 20 rows in set (3.10 sec)
>
> thanks
>
> --
> Cyril SCETBON
>
>