You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Vincent BARAT (JIRA)" <ji...@apache.org> on 2011/09/09 11:37:08 UTC

[jira] [Created] (PIG-2271) PIG regression (in BinStorage?) between 0.8.1 and 0.9.x

PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
-------------------------------------------------------

                 Key: PIG-2271
                 URL: https://issues.apache.org/jira/browse/PIG-2271
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Vincent BARAT


I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities;
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities;
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-2271:
-------------------------------

    Affects Version/s: 0.10
    
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

    Attachment: activity
    
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp1' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in another temporary file
> STORE activities INTO 'tmp2' USING PigStorage();
> {code}
> The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store / reload my temporary files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair reassigned PIG-2271:
----------------------------------

    Assignee: Thejas M Nair
    
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Assignee: Thejas M Nair
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

    Description: 
I'm using the 0.9.1 official release.

My input data are read form a text file 'activity' (provided as attachment):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp1' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });

-- store grouped activities again in another temporary file
STORE activities INTO 'tmp2' USING PigStorage();
{code}

The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store / reload my temporary files.


  was:
I'm using the 0.9.1 official release.

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities into 'output';
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities into 'output';
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...


       Priority: Critical  (was: Major)
        Summary: PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x  (was: PIG regression (in BinStorage?) between 0.8.1 and 0.9.x)
    
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp1' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in another temporary file
> STORE activities INTO 'tmp2' USING PigStorage();
> {code}
> The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store / reload my temporary files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

    Summary: PIG regression in BinStorage/PigStorage in 0.9.1  (was: PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x)
    
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression (in BinStorage?) between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

          Description: 
I'm using the 0.9.1 official release.

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities into 'output';
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities into 'output';
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...


  was:
I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities into 'output';
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities into 'output';
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...


    Affects Version/s:     (was: 0.9.0)
                       0.9.1
    
> PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
> -------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>
> I'm using the 0.9.1 official release.
> I've an UDF function that takes a bag as input:
> {code}
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(0);
> ...
> {code}
> My input data are read form a text file 'activity' (same issue when they are read from HBase):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
> store activities into 'output';
> {code}
> N.B. the name of the first activity is correctly set to null in my UDF function.
> The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).
> Second script that triggers an error (this script work correctly with PIG 0.8.1):
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities into 'output';
> {code}
> In this script, when MyUDF is called, activityBag is null, and a warning is issued:
> {code}
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
> {code}
> I guess that the regression is located into BinStorage...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair resolved PIG-2271.
--------------------------------

      Resolution: Fixed
    Release Note: patch committed to 0.9 branch and trunk
    
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Assignee: Thejas M Nair
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122029#comment-13122029 ] 

Vincent BARAT commented on PIG-2271:
------------------------------------

Hi Daniel,

I did more investigations and fully reformulated the issue. There is no more UDF function involved, and I reproduce it with the 0.9.1 official release.

The issue is related to BinStorage (but I can also reproduce it using PigStorage(',')).

This is a really blocking issue for me, as I need to use BinStorage() to load some binary data. This issue prevent be from using pig 0.9.1.

Thanks a lot for your time.
                
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-2271:
-------------------------------

    Attachment: PIG-2271.0.patch

PIG-2271.0.patch - initial patch. Test cases need to be added. 

The type conversion when user specified schema is present was not handling nulls correctly, it resulted in a cast failure. So the type conversion for the tuple that contained null was not successful. 
                
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2271) PIG regression (in BinStorage?) between 0.8.1 and 0.9.x

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101714#comment-13101714 ] 

Daniel Dai commented on PIG-2271:
---------------------------------

Can you do these:
1. Get the output schema for MyUDF. (describe activities)
2. Use a different construct for BinStorage: BinStorage("org.apache.pig.builtin.Utf8StorageConverter")

> PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
> -------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Vincent BARAT
>
> I'm using the 0.9.x branch (tested at 2011-09-07).
> I've an UDF function that takes a bag as input:
> {code}
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(0);
> ...
> {code}
> My input data are read form a text file 'activity' (same issue when they are read from HBase):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
> store activities;
> {code}
> N.B. the name of the first activity is correctly set to null in my UDF function.
> The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).
> Second script that triggers an error (this script work correctly with PIG 0.8.1):
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities;
> {code}
> In this script, when MyUDF is called, activityBag is null, and a warning is issued:
> {code}
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
> {code}
> I guess that the regression is located into BinStorage...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

    Description: 
I'm using the 0.9.1 official release.

My input data are read form a text file 'activity' (provided as attachment):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My script is working correctly:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });

-- store grouped activities again in an output file
STORE activities INTO 'output' USING PigStorage();
{code}

After running this script, the 'output' file contains a correct result:

{code}
00	{(1239698069000,)}
01	{(1239698505000,b),(1239698369000,a)}
02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
04	{(1239698417000,c)}
{code}

But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:

{code}
00	
01	{(1239698505000,b),(1239698369000,a)}
02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
04	{(1239698417000,c)}
{code}

The not working script is the following:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });

-- store grouped activities again in an output file
STORE activities INTO 'output' USING PigStorage();
{code}

So the issue seems to be located in the way the BinStorage() store or load bags.


  was:
I'm using the 0.9.1 official release.

My input data are read form a text file 'activity' (provided as attachment):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp1' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });

-- store grouped activities again in another temporary file
STORE activities INTO 'tmp2' USING PigStorage();
{code}

The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store / reload my temporary files.


    
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122305#comment-13122305 ] 

Thejas M Nair commented on PIG-2271:
------------------------------------

I would like to clarify that the type conversion fails only when user casts a complex type (tuple/bag/map) using a schema with inner schema, and one of the values inside is null.

                
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124549#comment-13124549 ] 

Daniel Dai commented on PIG-2271:
---------------------------------

+1
                
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Assignee: Thejas M Nair
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-2271:
-------------------------------

    Attachment: PIG-2271.1.patch

PIG-2271.1.patch - patch with test cases.

                
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression (in BinStorage?) between 0.8.1 and 0.9.x

Posted by "Vincent BARAT (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent BARAT updated PIG-2271:
-------------------------------

    Description: 
I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities into 'output';
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities into 'output';
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...


  was:
I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}

My input data are read form a text file 'activity' (same issue when they are read from HBase):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
store activities;
{code}

N.B. the name of the first activity is correctly set to null in my UDF function.

The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly with PIG 0.8.1):

{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities;
{code}

In this script, when MyUDF is called, activityBag is null, and a warning is issued:

{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
{code}

I guess that the regression is located into BinStorage...




    
> PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
> -------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Vincent BARAT
>
> I'm using the 0.9.x branch (tested at 2011-09-07).
> I've an UDF function that takes a bag as input:
> {code}
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(0);
> ...
> {code}
> My input data are read form a text file 'activity' (same issue when they are read from HBase):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp, name));
> store activities into 'output';
> {code}
> N.B. the name of the first activity is correctly set to null in my UDF function.
> The issue occurs when I store my data into a binary file are reload them before processing (I do this to improve the computation time, since HDFS is much faster than HBase).
> Second script that triggers an error (this script work correctly with PIG 0.8.1):
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray, activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities into 'output';
> {code}
> In this script, when MyUDF is called, activityBag is null, and a warning is issued:
> {code}
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: Unable to interpret value {(1239698069000,)} in field being converted to type bag, caught ParseException <Cannot convert (1239698069000,) to null:(timestamp:long,name:chararray)> field discarded
> {code}
> I guess that the regression is located into BinStorage...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2271) PIG regression in BinStorage/PigStorage in 0.9.1

Posted by "Thejas M Nair (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-2271:
-------------------------------

    Fix Version/s: 0.9.2
                   0.10
    
> PIG regression in BinStorage/PigStorage in 0.9.1
> ------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1, 0.10
>            Reporter: Vincent BARAT
>            Assignee: Thejas M Nair
>            Priority: Critical
>             Fix For: 0.10, 0.9.2
>
>         Attachments: PIG-2271.0.patch, PIG-2271.1.patch, activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira