You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Wasti, Syed" <md...@hotmail.com> on 2010/08/24 22:12:06 UTC

Group By data

Hi,
I have a very simple script and seeing a very strange behavior, getting
wrong results when running this script from a file, while running the same
statements on the pig grunt shell I get accurate results.

table =       LOAD ' sample' USING PigStorage('\t')
                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);

gen_table =    FOREACH table GENERATE a, id, date;

grp_table =   GROUP gen_table BY (a, id);

gen_grp_table =FOREACH grp_table {
                min_creation_date = MIN(gen_table.date);
                max_creation_date = MAX(gen_table.date);
                GENERATE group.id,
                (chararray)(group.a == 1?min_creation_date:null) AS
first_p_date,
                (chararray)(group.a == 1?max_creation_date:null) AS
last_p_date,
                (chararray)(group.a == 2?min_creation_date:null) AS
first_n_date,
                (chararray)(group.a == 2?max_creation_date:null) AS
last_n_date,
                (chararray)(group.a == 3?min_creation_date:null) AS
first_t_date,
                (chararray)(group.a == 3?max_creation_date:null) AS
last_t_date ;};
                
dump gen_grp_table;

Wrong results when running from the script, these dates belong to some other
id¹s.
(3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
(3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
(5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
(1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)

Expected results, you this only when running from grunt shell
(3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
(3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
(5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
(1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)

Have someone come across a similar issue, I am using the trunk version of
pig and not sure why this behavior, suggestions please.

Regards
Syed

Re: Group By data

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
One possibility might be some bug in use of combiner.
You could try disabling them and seeing if it works ...

Regards,
Mridul

On Wednesday 25 August 2010 01:42 AM, Wasti, Syed wrote:
> Hi,
> I have a very simple script and seeing a very strange behavior, getting
> wrong results when running this script from a file, while running the same
> statements on the pig grunt shell I get accurate results.
>
> table =       LOAD ' sample' USING PigStorage('\t')
>                      AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>
> gen_table =    FOREACH table GENERATE a, id, date;
>
> grp_table =   GROUP gen_table BY (a, id);
>
> gen_grp_table =FOREACH grp_table {
>                  min_creation_date = MIN(gen_table.date);
>                  max_creation_date = MAX(gen_table.date);
>                  GENERATE group.id,
>                  (chararray)(group.a == 1?min_creation_date:null) AS
> first_p_date,
>                  (chararray)(group.a == 1?max_creation_date:null) AS
> last_p_date,
>                  (chararray)(group.a == 2?min_creation_date:null) AS
> first_n_date,
>                  (chararray)(group.a == 2?max_creation_date:null) AS
> last_n_date,
>                  (chararray)(group.a == 3?min_creation_date:null) AS
> first_t_date,
>                  (chararray)(group.a == 3?max_creation_date:null) AS
> last_t_date ;};
>
> dump gen_grp_table;
>
> Wrong results when running from the script, these dates belong to some other
> id¹s.
> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>
> Expected results, you this only when running from grunt shell
> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>
> Have someone come across a similar issue, I am using the trunk version of
> pig and not sure why this behavior, suggestions please.
>
> Regards
> Syed


Re: Group By data

Posted by "Wasti, Syed" <md...@hotmail.com>.
Thanks Xiaomeng, but no luck.


On 8/25/10 2:00 PM, "Xiaomeng Wan" <sh...@gmail.com> wrote:

> came across similar problem before. try to break
> 
> gen_grp_table =FOREACH grp_table {
>                min_creation_date = MIN(gen_table.date);
>                max_creation_date = MAX(gen_table.date);
>                GENERATE group.id,
>                (chararray)(group.a == 1?min_creation_date:null) AS
> first_p_date,
>                (chararray)(group.a == 1?max_creation_date:null) AS
> last_p_date,
>                (chararray)(group.a == 2?min_creation_date:null) AS
> first_n_date,
>                (chararray)(group.a == 2?max_creation_date:null) AS
> last_n_date,
>                (chararray)(group.a == 3?min_creation_date:null) AS
> first_t_date,
>                (chararray)(group.a == 3?max_creation_date:null) AS
> last_t_date ;};
> 
> into two steps. gen_grp_table = foreach grp_table generate
> FLATTEN(group) as (a, id), min_creation_date, max_creation_date; and
> then foreach gen_grp_table generate id,                (chararray)(a
> == 1?min_creation_date:null) AS
> first_p_date,
>                (chararray)(a == 1?max_creation_date:null) AS
> last_p_date,
>                (chararray)(a == 2?min_creation_date:null) AS
> first_n_date,
>                (chararray)(a == 2?max_creation_date:null) AS
> last_n_date,
>                (chararray)(a == 3?min_creation_date:null) AS
> first_t_date,
>                (chararray)(a == 3?max_creation_date:null) AS
> last_t_date ;
> 
> see whether that works.
> 
> 
> On Wed, Aug 25, 2010 at 2:42 PM, Wasti, Syed <md...@hotmail.com> wrote:
>> It is older then Aug 9th. I updated the trunk version of pig, but have no
>> luck.
>> 
>> 
>> On 8/25/10 1:08 PM, "Thejas M Nair" <te...@yahoo-inc.com> wrote:
>> 
>>> I think this issue can be caused by
>>> https://issues.apache.org/jira/browse/PIG-1525 , can you check if your trunk
>>> version of pig is newer than Aug 9th ?
>>> (I haven't tried running query against the sample yet).
>>> -Thejas
>>> 
>>> 
>>> 
>>> On 8/25/10 11:21 AM, "Wasti, Syed" <md...@hotmail.com> wrote:
>>> 
>>> Hi Dmitriy,
>>> Thanks for offering help, attached is the sample data file. From my
>>> observation it looks like it has to something with min function on the
>>> grouped data. The id's for which it is picking up the wrong date, the date
>>> is from the previous id in sequence. You should get a better idea when you
>>> see the output data.
>>> Please let  me know of your findings.
>>> 
>>> 
>>> 
>>> On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>> 
>>>> Could you send sample data that would allow us to reproduce this error?
>>>> 
>>>> -Dmitriy
>>>> 
>>>> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:
>>>> 
>>>>> Hi,
>>>>> I have a very simple script and seeing a very strange behavior, getting
>>>>> wrong results when running this script from a file, while running the same
>>>>> statements on the pig grunt shell I get accurate results.
>>>>> 
>>>>> table =       LOAD ' sample' USING PigStorage('\t')
>>>>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>>>>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>>>>> 
>>>>> gen_table =    FOREACH table GENERATE a, id, date;
>>>>> 
>>>>> grp_table =   GROUP gen_table BY (a, id);
>>>>> 
>>>>> gen_grp_table =FOREACH grp_table {
>>>>>                min_creation_date = MIN(gen_table.date);
>>>>>                max_creation_date = MAX(gen_table.date);
>>>>>                GENERATE group.id,
>>>>>                (chararray)(group.a == 1?min_creation_date:null) AS
>>>>> first_p_date,
>>>>>                (chararray)(group.a == 1?max_creation_date:null) AS
>>>>> last_p_date,
>>>>>                (chararray)(group.a == 2?min_creation_date:null) AS
>>>>> first_n_date,
>>>>>                (chararray)(group.a == 2?max_creation_date:null) AS
>>>>> last_n_date,
>>>>>                (chararray)(group.a == 3?min_creation_date:null) AS
>>>>> first_t_date,
>>>>>                (chararray)(group.a == 3?max_creation_date:null) AS
>>>>> last_t_date ;};
>>>>> 
>>>>> dump gen_grp_table;
>>>>> 
>>>>> Wrong results when running from the script, these dates belong to some
>>>>> other
>>>>> id's.
>>>>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>>>>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>>>>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>>>>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>>>>> 
>>>>> Expected results, you this only when running from grunt shell
>>>>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>>>>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>>>>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>>>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>>>> 
>>>>> Have someone come across a similar issue, I am using the trunk version of
>>>>> pig and not sure why this behavior, suggestions please.
>>>>> 
>>>>> Regards
>>>>> Syed
>>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 



Re: Group By data

Posted by Xiaomeng Wan <sh...@gmail.com>.
came across similar problem before. try to break

gen_grp_table =FOREACH grp_table {
               min_creation_date = MIN(gen_table.date);
               max_creation_date = MAX(gen_table.date);
               GENERATE group.id,
               (chararray)(group.a == 1?min_creation_date:null) AS
first_p_date,
               (chararray)(group.a == 1?max_creation_date:null) AS
last_p_date,
               (chararray)(group.a == 2?min_creation_date:null) AS
first_n_date,
               (chararray)(group.a == 2?max_creation_date:null) AS
last_n_date,
               (chararray)(group.a == 3?min_creation_date:null) AS
first_t_date,
               (chararray)(group.a == 3?max_creation_date:null) AS
last_t_date ;};

into two steps. gen_grp_table = foreach grp_table generate
FLATTEN(group) as (a, id), min_creation_date, max_creation_date; and
then foreach gen_grp_table generate id,                (chararray)(a
== 1?min_creation_date:null) AS
first_p_date,
               (chararray)(a == 1?max_creation_date:null) AS
last_p_date,
               (chararray)(a == 2?min_creation_date:null) AS
first_n_date,
               (chararray)(a == 2?max_creation_date:null) AS
last_n_date,
               (chararray)(a == 3?min_creation_date:null) AS
first_t_date,
               (chararray)(a == 3?max_creation_date:null) AS
last_t_date ;

see whether that works.


On Wed, Aug 25, 2010 at 2:42 PM, Wasti, Syed <md...@hotmail.com> wrote:
> It is older then Aug 9th. I updated the trunk version of pig, but have no
> luck.
>
>
> On 8/25/10 1:08 PM, "Thejas M Nair" <te...@yahoo-inc.com> wrote:
>
>> I think this issue can be caused by
>> https://issues.apache.org/jira/browse/PIG-1525 , can you check if your trunk
>> version of pig is newer than Aug 9th ?
>> (I haven't tried running query against the sample yet).
>> -Thejas
>>
>>
>>
>> On 8/25/10 11:21 AM, "Wasti, Syed" <md...@hotmail.com> wrote:
>>
>> Hi Dmitriy,
>> Thanks for offering help, attached is the sample data file. From my
>> observation it looks like it has to something with min function on the
>> grouped data. The id's for which it is picking up the wrong date, the date
>> is from the previous id in sequence. You should get a better idea when you
>> see the output data.
>> Please let  me know of your findings.
>>
>>
>>
>> On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>>
>>> Could you send sample data that would allow us to reproduce this error?
>>>
>>> -Dmitriy
>>>
>>> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:
>>>
>>>> Hi,
>>>> I have a very simple script and seeing a very strange behavior, getting
>>>> wrong results when running this script from a file, while running the same
>>>> statements on the pig grunt shell I get accurate results.
>>>>
>>>> table =       LOAD ' sample' USING PigStorage('\t')
>>>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>>>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>>>>
>>>> gen_table =    FOREACH table GENERATE a, id, date;
>>>>
>>>> grp_table =   GROUP gen_table BY (a, id);
>>>>
>>>> gen_grp_table =FOREACH grp_table {
>>>>                min_creation_date = MIN(gen_table.date);
>>>>                max_creation_date = MAX(gen_table.date);
>>>>                GENERATE group.id,
>>>>                (chararray)(group.a == 1?min_creation_date:null) AS
>>>> first_p_date,
>>>>                (chararray)(group.a == 1?max_creation_date:null) AS
>>>> last_p_date,
>>>>                (chararray)(group.a == 2?min_creation_date:null) AS
>>>> first_n_date,
>>>>                (chararray)(group.a == 2?max_creation_date:null) AS
>>>> last_n_date,
>>>>                (chararray)(group.a == 3?min_creation_date:null) AS
>>>> first_t_date,
>>>>                (chararray)(group.a == 3?max_creation_date:null) AS
>>>> last_t_date ;};
>>>>
>>>> dump gen_grp_table;
>>>>
>>>> Wrong results when running from the script, these dates belong to some
>>>> other
>>>> id's.
>>>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>>>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>>>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>>>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>>>>
>>>> Expected results, you this only when running from grunt shell
>>>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>>>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>>>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>>>
>>>> Have someone come across a similar issue, I am using the trunk version of
>>>> pig and not sure why this behavior, suggestions please.
>>>>
>>>> Regards
>>>> Syed
>>>>
>>
>>
>>
>
>
>

Re: Group By data

Posted by "Wasti, Syed" <md...@hotmail.com>.
It is older then Aug 9th. I updated the trunk version of pig, but have no
luck.


On 8/25/10 1:08 PM, "Thejas M Nair" <te...@yahoo-inc.com> wrote:

> I think this issue can be caused by
> https://issues.apache.org/jira/browse/PIG-1525 , can you check if your trunk
> version of pig is newer than Aug 9th ?
> (I haven't tried running query against the sample yet).
> -Thejas
> 
> 
> 
> On 8/25/10 11:21 AM, "Wasti, Syed" <md...@hotmail.com> wrote:
> 
> Hi Dmitriy,
> Thanks for offering help, attached is the sample data file. From my
> observation it looks like it has to something with min function on the
> grouped data. The id's for which it is picking up the wrong date, the date
> is from the previous id in sequence. You should get a better idea when you
> see the output data.
> Please let  me know of your findings.
> 
> 
> 
> On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
> 
>> Could you send sample data that would allow us to reproduce this error?
>> 
>> -Dmitriy
>> 
>> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:
>> 
>>> Hi,
>>> I have a very simple script and seeing a very strange behavior, getting
>>> wrong results when running this script from a file, while running the same
>>> statements on the pig grunt shell I get accurate results.
>>> 
>>> table =       LOAD ' sample' USING PigStorage('\t')
>>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>>> 
>>> gen_table =    FOREACH table GENERATE a, id, date;
>>> 
>>> grp_table =   GROUP gen_table BY (a, id);
>>> 
>>> gen_grp_table =FOREACH grp_table {
>>>                min_creation_date = MIN(gen_table.date);
>>>                max_creation_date = MAX(gen_table.date);
>>>                GENERATE group.id,
>>>                (chararray)(group.a == 1?min_creation_date:null) AS
>>> first_p_date,
>>>                (chararray)(group.a == 1?max_creation_date:null) AS
>>> last_p_date,
>>>                (chararray)(group.a == 2?min_creation_date:null) AS
>>> first_n_date,
>>>                (chararray)(group.a == 2?max_creation_date:null) AS
>>> last_n_date,
>>>                (chararray)(group.a == 3?min_creation_date:null) AS
>>> first_t_date,
>>>                (chararray)(group.a == 3?max_creation_date:null) AS
>>> last_t_date ;};
>>> 
>>> dump gen_grp_table;
>>> 
>>> Wrong results when running from the script, these dates belong to some
>>> other
>>> id's.
>>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>>> 
>>> Expected results, you this only when running from grunt shell
>>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>> 
>>> Have someone come across a similar issue, I am using the trunk version of
>>> pig and not sure why this behavior, suggestions please.
>>> 
>>> Regards
>>> Syed
>>> 
> 
> 
> 



Re: Group By data

Posted by Thejas M Nair <te...@yahoo-inc.com>.
I think this issue can be caused by https://issues.apache.org/jira/browse/PIG-1525 , can you check if your trunk version of pig is newer than Aug 9th ?
(I haven't tried running query against the sample yet).
-Thejas



On 8/25/10 11:21 AM, "Wasti, Syed" <md...@hotmail.com> wrote:

Hi Dmitriy,
Thanks for offering help, attached is the sample data file. From my
observation it looks like it has to something with min function on the
grouped data. The id's for which it is picking up the wrong date, the date
is from the previous id in sequence. You should get a better idea when you
see the output data.
Please let  me know of your findings.



On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> Could you send sample data that would allow us to reproduce this error?
>
> -Dmitriy
>
> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:
>
>> Hi,
>> I have a very simple script and seeing a very strange behavior, getting
>> wrong results when running this script from a file, while running the same
>> statements on the pig grunt shell I get accurate results.
>>
>> table =       LOAD ' sample' USING PigStorage('\t')
>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>>
>> gen_table =    FOREACH table GENERATE a, id, date;
>>
>> grp_table =   GROUP gen_table BY (a, id);
>>
>> gen_grp_table =FOREACH grp_table {
>>                min_creation_date = MIN(gen_table.date);
>>                max_creation_date = MAX(gen_table.date);
>>                GENERATE group.id,
>>                (chararray)(group.a == 1?min_creation_date:null) AS
>> first_p_date,
>>                (chararray)(group.a == 1?max_creation_date:null) AS
>> last_p_date,
>>                (chararray)(group.a == 2?min_creation_date:null) AS
>> first_n_date,
>>                (chararray)(group.a == 2?max_creation_date:null) AS
>> last_n_date,
>>                (chararray)(group.a == 3?min_creation_date:null) AS
>> first_t_date,
>>                (chararray)(group.a == 3?max_creation_date:null) AS
>> last_t_date ;};
>>
>> dump gen_grp_table;
>>
>> Wrong results when running from the script, these dates belong to some
>> other
>> id's.
>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>>
>> Expected results, you this only when running from grunt shell
>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>>
>> Have someone come across a similar issue, I am using the trunk version of
>> pig and not sure why this behavior, suggestions please.
>>
>> Regards
>> Syed
>>




Re: Group By data

Posted by "Wasti, Syed" <md...@hotmail.com>.
Hi Dmitriy,
Thanks for offering help, attached is the sample data file. From my
observation it looks like it has to something with min function on the
grouped data. The id's for which it is picking up the wrong date, the date
is from the previous id in sequence. You should get a better idea when you
see the output data.
Please let  me know of your findings.



On 8/24/10 11:33 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> Could you send sample data that would allow us to reproduce this error?
> 
> -Dmitriy
> 
> On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:
> 
>> Hi,
>> I have a very simple script and seeing a very strange behavior, getting
>> wrong results when running this script from a file, while running the same
>> statements on the pig grunt shell I get accurate results.
>> 
>> table =       LOAD ' sample' USING PigStorage('\t')
>>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
>> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>> 
>> gen_table =    FOREACH table GENERATE a, id, date;
>> 
>> grp_table =   GROUP gen_table BY (a, id);
>> 
>> gen_grp_table =FOREACH grp_table {
>>                min_creation_date = MIN(gen_table.date);
>>                max_creation_date = MAX(gen_table.date);
>>                GENERATE group.id,
>>                (chararray)(group.a == 1?min_creation_date:null) AS
>> first_p_date,
>>                (chararray)(group.a == 1?max_creation_date:null) AS
>> last_p_date,
>>                (chararray)(group.a == 2?min_creation_date:null) AS
>> first_n_date,
>>                (chararray)(group.a == 2?max_creation_date:null) AS
>> last_n_date,
>>                (chararray)(group.a == 3?min_creation_date:null) AS
>> first_t_date,
>>                (chararray)(group.a == 3?max_creation_date:null) AS
>> last_t_date ;};
>> 
>> dump gen_grp_table;
>> 
>> Wrong results when running from the script, these dates belong to some
>> other
>> id¹s.
>> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
>> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
>> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
>> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>> 
>> Expected results, you this only when running from grunt shell
>> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
>> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
>> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>> 
>> Have someone come across a similar issue, I am using the trunk version of
>> pig and not sure why this behavior, suggestions please.
>> 
>> Regards
>> Syed
>> 


Re: Group By data

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Could you send sample data that would allow us to reproduce this error?

-Dmitriy

On Tue, Aug 24, 2010 at 1:12 PM, Wasti, Syed <md...@hotmail.com> wrote:

> Hi,
> I have a very simple script and seeing a very strange behavior, getting
> wrong results when running this script from a file, while running the same
> statements on the pig grunt shell I get accurate results.
>
> table =       LOAD ' sample' USING PigStorage('\t')
>                    AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
>
> gen_table =    FOREACH table GENERATE a, id, date;
>
> grp_table =   GROUP gen_table BY (a, id);
>
> gen_grp_table =FOREACH grp_table {
>                min_creation_date = MIN(gen_table.date);
>                max_creation_date = MAX(gen_table.date);
>                GENERATE group.id,
>                (chararray)(group.a == 1?min_creation_date:null) AS
> first_p_date,
>                (chararray)(group.a == 1?max_creation_date:null) AS
> last_p_date,
>                (chararray)(group.a == 2?min_creation_date:null) AS
> first_n_date,
>                (chararray)(group.a == 2?max_creation_date:null) AS
> last_n_date,
>                (chararray)(group.a == 3?min_creation_date:null) AS
> first_t_date,
>                (chararray)(group.a == 3?max_creation_date:null) AS
> last_t_date ;};
>
> dump gen_grp_table;
>
> Wrong results when running from the script, these dates belong to some
> other
> id¹s.
> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
>
> Expected results, you this only when running from grunt shell
> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
>
> Have someone come across a similar issue, I am using the trunk version of
> pig and not sure why this behavior, suggestions please.
>
> Regards
> Syed
>

Re: Group By data

Posted by "Wasti, Syed" <md...@hotmail.com>.
On using illustrate operator I get the below error.
Illustrate gen_grp_table;

java.lang.NullPointerException
    at 
org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:20
5)
    at 
org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:80
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:80
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:80
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:80
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:80
)
    at 
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69
)
    at 
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
    at org.apache.pig.PigServer.getExamples(PigServer.java:1063)
    at 
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:61
0)
    at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.
java:296)
    at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162
)
    at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138
)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
    at org.apache.pig.Main.run(Main.java:411)
    at org.apache.pig.Main.main(Main.java:103)



On 8/24/10 1:12 PM, "Wasti, Syed" <md...@hotmail.com> wrote:

> Hi,
> I have a very simple script and seeing a very strange behavior, getting
> wrong results when running this script from a file, while running the same
> statements on the pig grunt shell I get accurate results.
> 
> table =       LOAD ' sample' USING PigStorage('\t')
>                     AS(a: long, b: int, c, id: long, e, f, g, h, i: int, j,
> k, l, m, n, o, p, q, s, t, u, v, w, x, date: chararray, z);
> 
> gen_table =    FOREACH table GENERATE a, id, date;
> 
> grp_table =   GROUP gen_table BY (a, id);
> 
> gen_grp_table =FOREACH grp_table {
>                 min_creation_date = MIN(gen_table.date);
>                 max_creation_date = MAX(gen_table.date);
>                 GENERATE group.id,
>                 (chararray)(group.a == 1?min_creation_date:null) AS
> first_p_date,
>                 (chararray)(group.a == 1?max_creation_date:null) AS
> last_p_date,
>                 (chararray)(group.a == 2?min_creation_date:null) AS
> first_n_date,
>                 (chararray)(group.a == 2?max_creation_date:null) AS
> last_n_date,
>                 (chararray)(group.a == 3?min_creation_date:null) AS
> first_t_date,
>                 (chararray)(group.a == 3?max_creation_date:null) AS
> last_t_date ;};
>                 
> dump gen_grp_table;
> 
> Wrong results when running from the script, these dates belong to some other
> id¹s.
> (3860,,,2010-03-24 22:49:38,1970-01-01 00:00:00,,)
> (3509,,,2010-08-12 04:57:17,2003-05-20 17:02:54,,)
> (5096,,,,,2010-08-20 00:43:08,1970-01-01 00:00:00)
> (1673,,,,,2010-08-20 02:19:44,1970-01-01 00:00:00)
> 
> Expected results, you this only when running from grunt shell
> (3860,,,1970-01-01 00:00:00,1970-01-01 00:00:00,,)
> (3509,,,2003-05-20 17:02:54,2003-05-20 17:02:54,,)
> (5096,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
> (1673,,,,,1970-01-01 00:00:00,1970-01-01 00:00:00)
> 
> Have someone come across a similar issue, I am using the trunk version of
> pig and not sure why this behavior, suggestions please.
> 
> Regards
> Syed