You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2010/08/05 03:01:16 UTC

[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

Column pruner causes wrong results when using both Custom Store UDF and PigStorage
----------------------------------------------------------------------------------

                 Key: PIG-1537
                 URL: https://issues.apache.org/jira/browse/PIG-1537
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Viraj Bhat


I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
                        a#'id' matches '1.*' OR
                        a#'id' matches '2.*' OR
                        a#'id' matches '3.*' OR
                        a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
                        a#'id' matches '65.*' OR
                        a#'id' matches '466.*' OR
                        a#'id' matches '043.*' OR
                        a#'id' matches '044.*' OR
                        a#'id' matches '0650.*' OR
                        a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
                        a#'query' as query,
                        a#'testid' as testid,
                        a#'timestamp' as timestamp,
                        a,
                        b,
                        c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-1537:
----------------------------

    Description: 
I have script which is of this pattern and it uses 2 StoreFunc's:

{code}
register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
                        a#'id' matches '1.*' OR
                        a#'id' matches '2.*' OR
                        a#'id' matches '3.*' OR
                        a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
                        a#'id' matches '65.*' OR
                        a#'id' matches '466.*' OR
                        a#'id' matches '043.*' OR
                        a#'id' matches '044.*' OR
                        a#'id' matches '0650.*' OR
                        a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
                        a#'query' as query,
                        a#'testid' as testid,
                        a#'timestamp' as timestamp,
                        a,
                        b,
                        c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
{code}

I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

  was:
I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
                        a#'id' matches '1.*' OR
                        a#'id' matches '2.*' OR
                        a#'id' matches '3.*' OR
                        a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
                        a#'id' matches '65.*' OR
                        a#'id' matches '466.*' OR
                        a#'id' matches '043.*' OR
                        a#'id' matches '044.*' OR
                        a#'id' matches '0650.*' OR
                        a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
                        a#'query' as query,
                        a#'testid' as testid,
                        a#'timestamp' as timestamp,
                        a,
                        b,
                        c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj


> Column pruner causes wrong results when using both Custom Store UDF and PigStorage
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1537
>                 URL: https://issues.apache.org/jira/browse/PIG-1537
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
>                         a#'id' matches '1.*' OR
>                         a#'id' matches '2.*' OR
>                         a#'id' matches '3.*' OR
>                         a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
>                         a#'id' matches '65.*' OR
>                         a#'id' matches '466.*' OR
>                         a#'id' matches '043.*' OR
>                         a#'id' matches '044.*' OR
>                         a#'id' matches '0650.*' OR
>                         a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
>                         a#'query' as query,
>                         a#'testid' as testid,
>                         a#'timestamp' as timestamp,
>                         a,
>                         b,
>                         c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-1537.
---------------------------------

    Resolution: Fixed

> Column pruner causes wrong results when using both Custom Store UDF and PigStorage
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1537
>                 URL: https://issues.apache.org/jira/browse/PIG-1537
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
>                         a#'id' matches '1.*' OR
>                         a#'id' matches '2.*' OR
>                         a#'id' matches '3.*' OR
>                         a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
>                         a#'id' matches '65.*' OR
>                         a#'id' matches '466.*' OR
>                         a#'id' matches '043.*' OR
>                         a#'id' matches '044.*' OR
>                         a#'id' matches '0650.*' OR
>                         a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
>                         a#'query' as query,
>                         a#'testid' as testid,
>                         a#'timestamp' as timestamp,
>                         a,
>                         b,
>                         c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1537:
--------------------------------

         Assignee: Daniel Dai
    Fix Version/s: 0.8.0

Daniel, can we test if this is a problem with 0.8

Viraj, is this data specific and if so can you provide data tp reproduce. Also, do you know which one produces correct results.

> Column pruner causes wrong results when using both Custom Store UDF and PigStorage
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1537
>                 URL: https://issues.apache.org/jira/browse/PIG-1537
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
>                         a#'id' matches '1.*' OR
>                         a#'id' matches '2.*' OR
>                         a#'id' matches '3.*' OR
>                         a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
>                         a#'id' matches '65.*' OR
>                         a#'id' matches '466.*' OR
>                         a#'id' matches '043.*' OR
>                         a#'id' matches '044.*' OR
>                         a#'id' matches '0650.*' OR
>                         a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
>                         a#'query' as query,
>                         a#'testid' as testid,
>                         a#'timestamp' as timestamp,
>                         a,
>                         b,
>                         c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895858#action_12895858 ] 

Viraj Bhat commented on PIG-1537:
---------------------------------

Hi Olga, I have given the specific script with UDF's for Daniel to test.  Thanks Daniel for your help.
The script which does not use Column Pruner optimization or disables it using -t gives correct results.
Viraj

> Column pruner causes wrong results when using both Custom Store UDF and PigStorage
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1537
>                 URL: https://issues.apache.org/jira/browse/PIG-1537
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
>                         a#'id' matches '1.*' OR
>                         a#'id' matches '2.*' OR
>                         a#'id' matches '3.*' OR
>                         a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
>                         a#'id' matches '65.*' OR
>                         a#'id' matches '466.*' OR
>                         a#'id' matches '043.*' OR
>                         a#'id' matches '044.*' OR
>                         a#'id' matches '0650.*' OR
>                         a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
>                         a#'query' as query,
>                         a#'testid' as testid,
>                         a#'timestamp' as timestamp,
>                         a,
>                         b,
>                         c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.