You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2011/08/29 17:05:41 UTC

PIG behavior changed between 0.8.1 and 0.9.x ?

I'm currently testing PIG 0.9.x branch.
Several of my jobs that use to work correctly with PIG 0.8.1 now 
fail due to a cast error returning a null pointer in one of my UDF 
function.

Apparently, PIG seems to be unable to convert some data to a bag 
when some of the tuple fields are null:

2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger | 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: 
Unable to interpret value {(1239698069000,)} in field being 
converted to type bag, caught ParseException <Cannot convert 
(1239698069000,) to null:(timestamp:long,name:chararray)> field 
discarded
2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner | 
job_local_0019

My UDF functions is:

   /**
    *...
    * @param start start of the session (in milliseconds since epoch)
    * @param end end of the session (in milliseconds since epoch)
    * @param activities a bag containing a set of activities in the 
form of a set of (timestamp:long,
    *          name:chararray) tuples
    * ...
    */
   public DataBag exec(Tuple input) throws IOException
   {
     /* Get session's start/end timestamps */
     long startSession = (Long) input.get(0);
     long endSession = (Long) input.get(1);

     /* Get the activity bag */
     DataBag activityBag = (DataBag) input.get(2);

                                      ^  here


Is that a regression ? Any idea to fix this ?

Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)

Re: PIG regression between 0.8.1 and 0.9.x

Posted by Vincent Barat <vi...@gmail.com>.
You should read

DataBag activityBag = (DataBag) input.get(0);

of course

Le 07/09/11 16:17, Vincent Barat a écrit :
> DataBag activityBag = (DataBag) input.get(2); 

Re: PIG regression in 0.9.1's BinStorage()

Posted by Thejas Nair <th...@hortonworks.com>.
This was a bug in cast operation, while applying the schema specified in 
the 2nd load statement.

Patch availble in the jira (PIG-2271).

-Thejas




On 10/6/11 9:09 AM, Vincent Barat wrote:
> Hi,
>
> I made more investigation and I updated the issue to provide a very easy
> way to reproduce it.
> This seems to be an important regression in BinStorage()
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 09/09/11 11:36, Vincent Barat a écrit :
>> Issue reported:
>>
>> https://issues.apache.org/jira/browse/PIG-2271
>>
>> Le 07/09/11 20:52, Kevin Burton a écrit :
>>> I believe that everything is byte array at first but I may be wrong… at
>>> least this has been the situation in my experiments.
>>>
>>> It is best to always specify schema though. Unless you're using Zebra
>>> which
>>> stores the schema directly (which is very handy btw).
>>>
>>> You could also try InterStorage (which you can use directly via the full
>>> classname) as it is more efficient if I recall correctly.
>>>
>>> While it probably would be nice for you to submit a bug and of course
>>> you
>>> can wait until it is fixed, it's probably faster for you to just work
>>> around
>>> it…
>>>
>>> Kevin
>>>
>>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<co...@tynt.com> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I think we might be seeing something related to this problem and can
>>>> confirm
>>>> it's in BinStorage for us.
>>>>
>>>> We stored referrer_stats_by_site using BinStorage. Here is a
>>>> describe of
>>>> the alias:
>>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>>> chararray,lcnt:
>>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>>
>>>> Now we try to load that data:
>>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>>> tcnt:long,
>>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>>> tcnt:long)})});
>>>>
>>>> but when we do we cannot find a certain 'site'.
>>>>
>>>> When we don't provide the schema:
>>>> referrers = LOAD 'mydata' USING BinStorage();
>>>>
>>>> It will load but referrerdomain is a bytearray instead of chararray. Is
>>>> pig
>>>> supposed to automatically cast this to a chararray for me? Is there any
>>>> reason why this data won't load unless we change the type to bytearray?
>>>>
>>>>
>>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<hashutosh@apache.org
>>>>> wrote:
>>>>> Vincent,
>>>>>
>>>>> Thanks for your hard work in isolating the bug. Its a perfect bug
>>>>> report.
>>>>> Seems like its a regression. Can you please open a jira with test data
>>>> and
>>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>>
>>>>> Ashutosh
>>>>>
>>>>> On Wed, Sep 7, 2011 at 07:17, Vincent Barat<vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I really need your help on this one! I've worked hard to isolate the
>>>>>> regression.
>>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>>
>>>>>> I've an UDF function that takes a bag as input:
>>>>>>
>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>> {
>>>>>> /* Get the activity bag */
>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>> …
>>>>>>
>>>>>> My input data are read form a text file 'activity' (same issue when
>>>> they
>>>>>> are read from HBase):
>>>>>> 00,1239698069000,<- this is the line that is not correctly handled
>>>>>> 01,1239698505000,b
>>>>>> 01,1239698369000,a
>>>>>> 02,1239698413000,b
>>>>>> 02,1239698553000,c
>>>>>> 02,1239698313000,a
>>>>>> 03,1239698316000,a
>>>>>> 03,1239698516000,c
>>>>>> 03,1239698416000,b
>>>>>> 03,1239698621000,d
>>>>>> 04,1239698417000,c
>>>>>>
>>>>>> My first script is working correctly:
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> MyUDF(activities.(timestamp, name));
>>>>>> store activities;
>>>>>>
>>>>>> N.B. the name of the first activity is correctly set to null in my
>>>>>> UDF
>>>>>> function.
>>>>>>
>>>>>> The issue occurs when I store my data into a binary file are relaod
>>>> them
>>>>>> before processing (I do this to improve the computation time, since
>>>> HDFS
>>>>> is
>>>>>> much faster than HBase).
>>>>>>
>>>>>> Second script that triggers an error (this script work correctly with
>>>> PIG
>>>>>> 0.8.1):
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> activities.(timestamp,
>>>>>> name);
>>>>>> STORE activities INTO 'activities' USING BinStorage;
>>>>>> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
>>>>>> activities:bag { activity: (timestamp:long, name:chararray) });
>>>>>> activities = FOREACH activities GENERATE sid, MyUDF(activities);
>>>>>> store activities;
>>>>>>
>>>>>> In this script, when MyUDF is calles, activityBag is null, and a
>>>> warning
>>>>> is
>>>>>> issued:
>>>>>>
>>>>>> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
>>>>>>
>>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>>>>
>>>>>> Unable to interpret value {(1239698069000,)} in field being converted
>>>> to
>>>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>>>> null:(timestamp:long,name:**chararray)> field discarded
>>>>>>
>>>>>> I guess that the regression is located into BinStorage
>>>>>>
>>>>>> Le 30/08/11 19:13, Daniel Dai a écrit :
>>>>>>
>>>>>>> Interesting, the log message seems to be clear, "Cannot convert
>>>>>>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>>>>>>> cannot find an explanation to that. I verified such conversion
>>>>>>> should
>>>>>>> be valid on 0.9. Can you show me the script?
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
>>>> vincent.barat@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have experienced the same issue by loading the data from raw text
>>>>> files
>>>>>>>> (using PIG server in local mode and the regular PIG loader) and
>>>>>>>> from
>>>>>>>> HBaseStorage.
>>>>>>>> The issue is exactly the same in both cases: each time a NULL
>>>>>>>> string
>>>> is
>>>>>>>> encountered, the cast to a data bag cannot be done.
>>>>>>>>
>>>>>>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>>>>>>
>>>>>>>>> How are you loading this data?
>>>>>>>>>
>>>>>>>>> D
>>>>>>>>>
>>>>>>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>>>>>>> Barat<vi...@gmail.com>**wrote:
>>>>>>>>>
>>>>>>>>> I'm currently testing PIG 0.9.x branch.
>>>>>>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now
>>>> fail
>>>>>>>>>> due
>>>>>>>>>> to a cast error returning a null pointer in one of my UDF
>>>>>>>>>> function.
>>>>>>>>>>
>>>>>>>>>> Apparently, PIG seems to be unable to convert some data to a bag
>>>> when
>>>>>>>>>> some
>>>>>>>>>> of the tuple fields are null:
>>>>>>>>>>
>>>>>>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>>>>>>>>
>>>>>>>>>>
>>>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
>>>>>>>>>> expressionOperators.POCast:
>>>>>>>>>> Unable to interpret value {(1239698069000,)} in field being
>>>> converted
>>>>>>>>>> to
>>>>>>>>>> type bag, caught ParseException<Cannot convert
>>>>>>>>>> (1239698069000,) to
>>>>>>>>>> null:(timestamp:long,name:****chararray)> field discarded
>>>>>>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>>>>>>>>> job_local_0019
>>>>>>>>>>
>>>>>>>>>> My UDF functions is:
>>>>>>>>>>
>>>>>>>>>> /**
>>>>>>>>>> *...
>>>>>>>>>> * @param start start of the session (in milliseconds since epoch)
>>>>>>>>>> * @param end end of the session (in milliseconds since epoch)
>>>>>>>>>> * @param activities a bag containing a set of activities in the
>>>>> form
>>>>>>>>>> of
>>>>>>>>>> a
>>>>>>>>>> set of (timestamp:long,
>>>>>>>>>> * name:chararray) tuples
>>>>>>>>>> * ...
>>>>>>>>>> */
>>>>>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>>>>>> {
>>>>>>>>>> /* Get session's start/end timestamps */
>>>>>>>>>> long startSession = (Long) input.get(0);
>>>>>>>>>> long endSession = (Long) input.get(1);
>>>>>>>>>>
>>>>>>>>>> /* Get the activity bag */
>>>>>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>>>>>>
>>>>>>>>>> ^ here
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Is that a regression ? Any idea to fix this ?
>>>>>>>>>>
>>>>>>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>>>>>>>>
>>>>>>>>>>
>>>
>>>
>>


PIG regression in 0.9.1's BinStorage()

Posted by Vincent Barat <vi...@gmail.com>.
Hi,

I made more investigation and I updated the issue to provide a very 
easy way to reproduce it.
This seems to be an important regression in BinStorage()

https://issues.apache.org/jira/browse/PIG-2271

Le 09/09/11 11:36, Vincent Barat a écrit :
> Issue reported:
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 07/09/11 20:52, Kevin Burton a écrit :
>> I believe that everything is byte array at first but I may be 
>> wrong… at
>> least this has been the situation in my experiments.
>>
>> It is best to always specify schema though.  Unless you're using 
>> Zebra which
>> stores the schema directly (which is very handy btw).
>>
>> You could also try InterStorage (which you can use directly via 
>> the full
>> classname) as it is more efficient if I recall correctly.
>>
>> While it probably would be nice for you to submit a bug and of 
>> course you
>> can wait until it is fixed, it's probably faster for you to just 
>> work around
>> it…
>>
>> Kevin
>>
>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<co...@tynt.com>  
>> wrote:
>>
>>> Hi there,
>>>
>>> I think we might be seeing something related to this problem and 
>>> can
>>> confirm
>>> it's in BinStorage for us.
>>>
>>> We stored referrer_stats_by_site using BinStorage.  Here is a 
>>> describe of
>>> the alias:
>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>> chararray,lcnt:
>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>
>>> Now we try to load that data:
>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>> tcnt:long,
>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>> tcnt:long)})});
>>>
>>> but when we do we cannot find a certain 'site'.
>>>
>>> When we don't provide the schema:
>>> referrers = LOAD 'mydata' USING BinStorage();
>>>
>>> It will load but referrerdomain is a bytearray instead of 
>>> chararray.  Is
>>> pig
>>> supposed to automatically cast this to a chararray for me?  Is 
>>> there any
>>> reason why this data won't load unless we change the type to 
>>> bytearray?
>>>
>>>
>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh 
>>> Chauhan<hashutosh@apache.org
>>>> wrote:
>>>> Vincent,
>>>>
>>>> Thanks for your hard work in isolating the bug. Its a perfect 
>>>> bug report.
>>>> Seems like its a regression. Can you please open a jira with 
>>>> test data
>>> and
>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>
>>>> Ashutosh
>>>>
>>>> On Wed, Sep 7, 2011 at 07:17, Vincent 
>>>> Barat<vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I really need your help on this one! I've worked hard to 
>>>>> isolate the
>>>>> regression.
>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>
>>>>> I've an UDF function that takes a bag as input:
>>>>>
>>>>> public DataBag exec(Tuple input) throws IOException
>>>>> {
>>>>> /* Get the activity bag */
>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>> …
>>>>>
>>>>> My input data are read form a text file 'activity' (same issue 
>>>>> when
>>> they
>>>>> are read from HBase):
>>>>> 00,1239698069000,<- this is the line that is not correctly 
>>>>> handled
>>>>> 01,1239698505000,b
>>>>> 01,1239698369000,a
>>>>> 02,1239698413000,b
>>>>> 02,1239698553000,c
>>>>> 02,1239698313000,a
>>>>> 03,1239698316000,a
>>>>> 03,1239698516000,c
>>>>> 03,1239698416000,b
>>>>> 03,1239698621000,d
>>>>> 04,1239698417000,c
>>>>>
>>>>> My first script is working correctly:
>>>>>
>>>>> activities = LOAD 'activity' USING PigStorage(',') AS 
>>>>> (sid:chararray,
>>>>> timestamp:long, name:chararray);
>>>>> activities = GROUP activities BY sid;
>>>>> activities = FOREACH activities GENERATE group,
>>>>> MyUDF(activities.(timestamp, name));
>>>>> store activities;
>>>>>
>>>>> N.B. the name of the first activity is correctly set to null 
>>>>> in my UDF
>>>>> function.
>>>>>
>>>>> The issue occurs when I store my data into a binary file are 
>>>>> relaod
>>> them
>>>>> before processing (I do this to improve the computation time, 
>>>>> since
>>> HDFS
>>>> is
>>>>> much faster than HBase).
>>>>>
>>>>> Second script that triggers an error (this script work 
>>>>> correctly with
>>> PIG
>>>>> 0.8.1):
>>>>>
>>>>> activities = LOAD 'activity' USING PigStorage(',') AS 
>>>>> (sid:chararray,
>>>>> timestamp:long, name:chararray);
>>>>> activities = GROUP activities BY sid;
>>>>> activities = FOREACH activities GENERATE group, 
>>>>> activities.(timestamp,
>>>>> name);
>>>>> STORE activities INTO 'activities' USING BinStorage;
>>>>> activities = LOAD 'activities' USING BinStorage AS 
>>>>> (sid:chararray,
>>>>> activities:bag { activity: (timestamp:long, name:chararray) });
>>>>> activities = FOREACH activities GENERATE sid, MyUDF(activities);
>>>>> store activities;
>>>>>
>>>>> In this script, when MyUDF is calles, activityBag is null, and a
>>> warning
>>>> is
>>>>> issued:
>>>>>
>>>>> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
>>>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast: 
>>>
>>>>> Unable to interpret value {(1239698069000,)} in field being 
>>>>> converted
>>> to
>>>>> type bag, caught ParseException<Cannot convert 
>>>>> (1239698069000,) to
>>>>> null:(timestamp:long,name:**chararray)>  field discarded
>>>>>
>>>>> I guess that the regression is located into BinStorage
>>>>>
>>>>> Le 30/08/11 19:13, Daniel Dai a écrit :
>>>>>
>>>>>> Interesting, the log message seems to be clear, "Cannot convert
>>>>>> (1239698069000,) to null:(timestamp:long,name:**chararray)", 
>>>>>> but I
>>>>>> cannot find an explanation to that. I verified such 
>>>>>> conversion should
>>>>>> be valid on 0.9. Can you show me the script?
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
>>> vincent.barat@gmail.com>
>>>>>>   wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have experienced the same issue by loading the data from 
>>>>>>> raw text
>>>> files
>>>>>>> (using PIG server in local mode and the regular PIG loader) 
>>>>>>> and from
>>>>>>> HBaseStorage.
>>>>>>> The issue is exactly the same in both cases: each time a 
>>>>>>> NULL string
>>> is
>>>>>>> encountered, the cast to a data bag cannot be done.
>>>>>>>
>>>>>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>>>>>
>>>>>>>> How are you loading this data?
>>>>>>>>
>>>>>>>> D
>>>>>>>>
>>>>>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>>>>>> Barat<vi...@gmail.com>**wrote:
>>>>>>>>
>>>>>>>>   I'm currently testing PIG 0.9.x branch.
>>>>>>>>> Several of my jobs that use to work correctly with PIG 
>>>>>>>>> 0.8.1 now
>>> fail
>>>>>>>>> due
>>>>>>>>> to a cast error returning a null pointer in one of my UDF 
>>>>>>>>> function.
>>>>>>>>>
>>>>>>>>> Apparently, PIG seems to be unable to convert some data to 
>>>>>>>>> a bag
>>> when
>>>>>>>>> some
>>>>>>>>> of the tuple fields are null:
>>>>>>>>>
>>>>>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | 
>>>>>>>>> PigHadoopLogger |
>>>>>>>>>
>>>>>>>>>
>>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.**** 
>>>>
>>>>>>>>> expressionOperators.POCast:
>>>>>>>>> Unable to interpret value {(1239698069000,)} in field being
>>> converted
>>>>>>>>> to
>>>>>>>>> type bag, caught ParseException<Cannot convert 
>>>>>>>>> (1239698069000,) to
>>>>>>>>> null:(timestamp:long,name:****chararray)>    field discarded
>>>>>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | 
>>>>>>>>> LocalJobRunner |
>>>>>>>>> job_local_0019
>>>>>>>>>
>>>>>>>>> My UDF functions is:
>>>>>>>>>
>>>>>>>>>   /**
>>>>>>>>>    *...
>>>>>>>>>    * @param start start of the session (in milliseconds 
>>>>>>>>> since epoch)
>>>>>>>>>    * @param end end of the session (in milliseconds since 
>>>>>>>>> epoch)
>>>>>>>>>    * @param activities a bag containing a set of 
>>>>>>>>> activities in the
>>>> form
>>>>>>>>> of
>>>>>>>>> a
>>>>>>>>> set of (timestamp:long,
>>>>>>>>>    *          name:chararray) tuples
>>>>>>>>>    * ...
>>>>>>>>>    */
>>>>>>>>>   public DataBag exec(Tuple input) throws IOException
>>>>>>>>>   {
>>>>>>>>>     /* Get session's start/end timestamps */
>>>>>>>>>     long startSession = (Long) input.get(0);
>>>>>>>>>     long endSession = (Long) input.get(1);
>>>>>>>>>
>>>>>>>>>     /* Get the activity bag */
>>>>>>>>>     DataBag activityBag = (DataBag) input.get(2);
>>>>>>>>>
>>>>>>>>>                                      ^  here
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is that a regression ? Any idea to fix this ?
>>>>>>>>>
>>>>>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 
>>>>>>>>> 0.10.0 :-)
>>>>>>>>>
>>>>>>>>>
>>
>>
>

Re: PIG regression between 0.8.1 and 0.9.x

Posted by Vincent Barat <vi...@gmail.com>.
Issue reported:

https://issues.apache.org/jira/browse/PIG-2271

Le 07/09/11 20:52, Kevin Burton a écrit :
> I believe that everything is byte array at first but I may be wrong… at
> least this has been the situation in my experiments.
>
> It is best to always specify schema though.  Unless you're using Zebra which
> stores the schema directly (which is very handy btw).
>
> You could also try InterStorage (which you can use directly via the full
> classname) as it is more efficient if I recall correctly.
>
> While it probably would be nice for you to submit a bug and of course you
> can wait until it is fixed, it's probably faster for you to just work around
> it…
>
> Kevin
>
> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<co...@tynt.com>  wrote:
>
>> Hi there,
>>
>> I think we might be seeing something related to this problem and can
>> confirm
>> it's in BinStorage for us.
>>
>> We stored referrer_stats_by_site using BinStorage.  Here is a describe of
>> the alias:
>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>> chararray,lcnt:
>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>
>> Now we try to load that data:
>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>> tcnt:long,
>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>> tcnt:long)})});
>>
>> but when we do we cannot find a certain 'site'.
>>
>> When we don't provide the schema:
>> referrers = LOAD 'mydata' USING BinStorage();
>>
>> It will load but referrerdomain is a bytearray instead of chararray.  Is
>> pig
>> supposed to automatically cast this to a chararray for me?  Is there any
>> reason why this data won't load unless we change the type to bytearray?
>>
>>
>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<hashutosh@apache.org
>>> wrote:
>>> Vincent,
>>>
>>> Thanks for your hard work in isolating the bug. Its a perfect bug report.
>>> Seems like its a regression. Can you please open a jira with test data
>> and
>>> script (which works in 0.8.1 and fails in 0.9)
>>>
>>> Ashutosh
>>>
>>> On Wed, Sep 7, 2011 at 07:17, Vincent Barat<vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I really need your help on this one! I've worked hard to isolate the
>>>> regression.
>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>
>>>> I've an UDF function that takes a bag as input:
>>>>
>>>> public DataBag exec(Tuple input) throws IOException
>>>> {
>>>> /* Get the activity bag */
>>>> DataBag activityBag = (DataBag) input.get(2);
>>>> …
>>>>
>>>> My input data are read form a text file 'activity' (same issue when
>> they
>>>> are read from HBase):
>>>> 00,1239698069000,<- this is the line that is not correctly handled
>>>> 01,1239698505000,b
>>>> 01,1239698369000,a
>>>> 02,1239698413000,b
>>>> 02,1239698553000,c
>>>> 02,1239698313000,a
>>>> 03,1239698316000,a
>>>> 03,1239698516000,c
>>>> 03,1239698416000,b
>>>> 03,1239698621000,d
>>>> 04,1239698417000,c
>>>>
>>>> My first script is working correctly:
>>>>
>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>> timestamp:long, name:chararray);
>>>> activities = GROUP activities BY sid;
>>>> activities = FOREACH activities GENERATE group,
>>>> MyUDF(activities.(timestamp, name));
>>>> store activities;
>>>>
>>>> N.B. the name of the first activity is correctly set to null in my UDF
>>>> function.
>>>>
>>>> The issue occurs when I store my data into a binary file are relaod
>> them
>>>> before processing (I do this to improve the computation time, since
>> HDFS
>>> is
>>>> much faster than HBase).
>>>>
>>>> Second script that triggers an error (this script work correctly with
>> PIG
>>>> 0.8.1):
>>>>
>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>> timestamp:long, name:chararray);
>>>> activities = GROUP activities BY sid;
>>>> activities = FOREACH activities GENERATE group, activities.(timestamp,
>>>> name);
>>>> STORE activities INTO 'activities' USING BinStorage;
>>>> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
>>>> activities:bag { activity: (timestamp:long, name:chararray) });
>>>> activities = FOREACH activities GENERATE sid, MyUDF(activities);
>>>> store activities;
>>>>
>>>> In this script, when MyUDF is calles, activityBag is null, and a
>> warning
>>> is
>>>> issued:
>>>>
>>>> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
>>>>
>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>>>> Unable to interpret value {(1239698069000,)} in field being converted
>> to
>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>> null:(timestamp:long,name:**chararray)>  field discarded
>>>>
>>>> I guess that the regression is located into BinStorage
>>>>
>>>> Le 30/08/11 19:13, Daniel Dai a écrit :
>>>>
>>>>> Interesting, the log message seems to be clear, "Cannot convert
>>>>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>>>>> cannot find an explanation to that. I verified such conversion should
>>>>> be valid on 0.9. Can you show me the script?
>>>>>
>>>>> Daniel
>>>>>
>>>>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
>> vincent.barat@gmail.com>
>>>>>   wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have experienced the same issue by loading the data from raw text
>>> files
>>>>>> (using PIG server in local mode and the regular PIG loader) and from
>>>>>> HBaseStorage.
>>>>>> The issue is exactly the same in both cases: each time a NULL string
>> is
>>>>>> encountered, the cast to a data bag cannot be done.
>>>>>>
>>>>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>>>>
>>>>>>> How are you loading this data?
>>>>>>>
>>>>>>> D
>>>>>>>
>>>>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>>>>> Barat<vi...@gmail.com>**wrote:
>>>>>>>
>>>>>>>   I'm currently testing PIG 0.9.x branch.
>>>>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now
>> fail
>>>>>>>> due
>>>>>>>> to a cast error returning a null pointer in one of my UDF function.
>>>>>>>>
>>>>>>>> Apparently, PIG seems to be unable to convert some data to a bag
>> when
>>>>>>>> some
>>>>>>>> of the tuple fields are null:
>>>>>>>>
>>>>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>>>>>>
>>>>>>>>
>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
>>>>>>>> expressionOperators.POCast:
>>>>>>>> Unable to interpret value {(1239698069000,)} in field being
>> converted
>>>>>>>> to
>>>>>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>>>>>> null:(timestamp:long,name:****chararray)>    field discarded
>>>>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>>>>>>> job_local_0019
>>>>>>>>
>>>>>>>> My UDF functions is:
>>>>>>>>
>>>>>>>>   /**
>>>>>>>>    *...
>>>>>>>>    * @param start start of the session (in milliseconds since epoch)
>>>>>>>>    * @param end end of the session (in milliseconds since epoch)
>>>>>>>>    * @param activities a bag containing a set of activities in the
>>> form
>>>>>>>> of
>>>>>>>> a
>>>>>>>> set of (timestamp:long,
>>>>>>>>    *          name:chararray) tuples
>>>>>>>>    * ...
>>>>>>>>    */
>>>>>>>>   public DataBag exec(Tuple input) throws IOException
>>>>>>>>   {
>>>>>>>>     /* Get session's start/end timestamps */
>>>>>>>>     long startSession = (Long) input.get(0);
>>>>>>>>     long endSession = (Long) input.get(1);
>>>>>>>>
>>>>>>>>     /* Get the activity bag */
>>>>>>>>     DataBag activityBag = (DataBag) input.get(2);
>>>>>>>>
>>>>>>>>                                      ^  here
>>>>>>>>
>>>>>>>>
>>>>>>>> Is that a regression ? Any idea to fix this ?
>>>>>>>>
>>>>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>>>>>>
>>>>>>>>
>
>

Re: PIG regression between 0.8.1 and 0.9.x

Posted by Kevin Burton <bu...@spinn3r.com>.
I believe that everything is byte array at first but I may be wrong… at
least this has been the situation in my experiments.

It is best to always specify schema though.  Unless you're using Zebra which
stores the schema directly (which is very handy btw).

You could also try InterStorage (which you can use directly via the full
classname) as it is more efficient if I recall correctly.

While it probably would be nice for you to submit a bug and of course you
can wait until it is fixed, it's probably faster for you to just work around
it…

Kevin

On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes <co...@tynt.com> wrote:

> Hi there,
>
> I think we might be seeing something related to this problem and can
> confirm
> it's in BinStorage for us.
>
> We stored referrer_stats_by_site using BinStorage.  Here is a describe of
> the alias:
> > referrer_stats_by_site: {site: chararray,{(referrerdomain:
> chararray,lcnt:
> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>
> Now we try to load that data:
> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
> tcnt:long,
> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
> tcnt:long)})});
>
> but when we do we cannot find a certain 'site'.
>
> When we don't provide the schema:
> referrers = LOAD 'mydata' USING BinStorage();
>
> It will load but referrerdomain is a bytearray instead of chararray.  Is
> pig
> supposed to automatically cast this to a chararray for me?  Is there any
> reason why this data won't load unless we change the type to bytearray?
>
>
> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan <hashutosh@apache.org
> >wrote:
>
> > Vincent,
> >
> > Thanks for your hard work in isolating the bug. Its a perfect bug report.
> > Seems like its a regression. Can you please open a jira with test data
> and
> > script (which works in 0.8.1 and fails in 0.9)
> >
> > Ashutosh
> >
> > On Wed, Sep 7, 2011 at 07:17, Vincent Barat <vi...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I really need your help on this one! I've worked hard to isolate the
> > > regression.
> > > I'm using the 0.9.x branch (tested at 2011-09-07).
> > >
> > > I've an UDF function that takes a bag as input:
> > >
> > > public DataBag exec(Tuple input) throws IOException
> > > {
> > > /* Get the activity bag */
> > > DataBag activityBag = (DataBag) input.get(2);
> > > …
> > >
> > > My input data are read form a text file 'activity' (same issue when
> they
> > > are read from HBase):
> > > 00,1239698069000, <- this is the line that is not correctly handled
> > > 01,1239698505000,b
> > > 01,1239698369000,a
> > > 02,1239698413000,b
> > > 02,1239698553000,c
> > > 02,1239698313000,a
> > > 03,1239698316000,a
> > > 03,1239698516000,c
> > > 03,1239698416000,b
> > > 03,1239698621000,d
> > > 04,1239698417000,c
> > >
> > > My first script is working correctly:
> > >
> > > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > > timestamp:long, name:chararray);
> > > activities = GROUP activities BY sid;
> > > activities = FOREACH activities GENERATE group,
> > > MyUDF(activities.(timestamp, name));
> > > store activities;
> > >
> > > N.B. the name of the first activity is correctly set to null in my UDF
> > > function.
> > >
> > > The issue occurs when I store my data into a binary file are relaod
> them
> > > before processing (I do this to improve the computation time, since
> HDFS
> > is
> > > much faster than HBase).
> > >
> > > Second script that triggers an error (this script work correctly with
> PIG
> > > 0.8.1):
> > >
> > > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > > timestamp:long, name:chararray);
> > > activities = GROUP activities BY sid;
> > > activities = FOREACH activities GENERATE group, activities.(timestamp,
> > > name);
> > > STORE activities INTO 'activities' USING BinStorage;
> > > activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> > > activities:bag { activity: (timestamp:long, name:chararray) });
> > > activities = FOREACH activities GENERATE sid, MyUDF(activities);
> > > store activities;
> > >
> > > In this script, when MyUDF is calles, activityBag is null, and a
> warning
> > is
> > > issued:
> > >
> > > 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> > >
> >
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> > > Unable to interpret value {(1239698069000,)} in field being converted
> to
> > > type bag, caught ParseException <Cannot convert (1239698069000,) to
> > > null:(timestamp:long,name:**chararray)> field discarded
> > >
> > > I guess that the regression is located into BinStorage
> > >
> > > Le 30/08/11 19:13, Daniel Dai a écrit :
> > >
> > >> Interesting, the log message seems to be clear, "Cannot convert
> > >> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
> > >> cannot find an explanation to that. I verified such conversion should
> > >> be valid on 0.9. Can you show me the script?
> > >>
> > >> Daniel
> > >>
> > >> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
> vincent.barat@gmail.com>
> > >>  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I have experienced the same issue by loading the data from raw text
> > files
> > >>> (using PIG server in local mode and the regular PIG loader) and from
> > >>> HBaseStorage.
> > >>> The issue is exactly the same in both cases: each time a NULL string
> is
> > >>> encountered, the cast to a data bag cannot be done.
> > >>>
> > >>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
> > >>>
> > >>>> How are you loading this data?
> > >>>>
> > >>>> D
> > >>>>
> > >>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
> > >>>> Barat<vi...@gmail.com>**wrote:
> > >>>>
> > >>>>  I'm currently testing PIG 0.9.x branch.
> > >>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now
> fail
> > >>>>> due
> > >>>>> to a cast error returning a null pointer in one of my UDF function.
> > >>>>>
> > >>>>> Apparently, PIG seems to be unable to convert some data to a bag
> when
> > >>>>> some
> > >>>>> of the tuple fields are null:
> > >>>>>
> > >>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
> > >>>>>
> > >>>>>
> > org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
> > >>>>> expressionOperators.POCast:
> > >>>>> Unable to interpret value {(1239698069000,)} in field being
> converted
> > >>>>> to
> > >>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
> > >>>>> null:(timestamp:long,name:****chararray)>   field discarded
> > >>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
> > >>>>> job_local_0019
> > >>>>>
> > >>>>> My UDF functions is:
> > >>>>>
> > >>>>>  /**
> > >>>>>   *...
> > >>>>>   * @param start start of the session (in milliseconds since epoch)
> > >>>>>   * @param end end of the session (in milliseconds since epoch)
> > >>>>>   * @param activities a bag containing a set of activities in the
> > form
> > >>>>> of
> > >>>>> a
> > >>>>> set of (timestamp:long,
> > >>>>>   *          name:chararray) tuples
> > >>>>>   * ...
> > >>>>>   */
> > >>>>>  public DataBag exec(Tuple input) throws IOException
> > >>>>>  {
> > >>>>>    /* Get session's start/end timestamps */
> > >>>>>    long startSession = (Long) input.get(0);
> > >>>>>    long endSession = (Long) input.get(1);
> > >>>>>
> > >>>>>    /* Get the activity bag */
> > >>>>>    DataBag activityBag = (DataBag) input.get(2);
> > >>>>>
> > >>>>>                                     ^  here
> > >>>>>
> > >>>>>
> > >>>>> Is that a regression ? Any idea to fix this ?
> > >>>>>
> > >>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
> > >>>>>
> > >>>>>
> >
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: PIG regression between 0.8.1 and 0.9.x

Posted by Corbin Hoenes <co...@tynt.com>.
Hi there,

I think we might be seeing something related to this problem and can confirm
it's in BinStorage for us.

We stored referrer_stats_by_site using BinStorage.  Here is a describe of
the alias:
> referrer_stats_by_site: {site: chararray,{(referrerdomain: chararray,lcnt:
long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}

Now we try to load that data:
referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
tcnt:long,
referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
tcnt:long)})});

but when we do we cannot find a certain 'site'.

When we don't provide the schema:
referrers = LOAD 'mydata' USING BinStorage();

It will load but referrerdomain is a bytearray instead of chararray.  Is pig
supposed to automatically cast this to a chararray for me?  Is there any
reason why this data won't load unless we change the type to bytearray?


On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Vincent,
>
> Thanks for your hard work in isolating the bug. Its a perfect bug report.
> Seems like its a regression. Can you please open a jira with test data and
> script (which works in 0.8.1 and fails in 0.9)
>
> Ashutosh
>
> On Wed, Sep 7, 2011 at 07:17, Vincent Barat <vi...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I really need your help on this one! I've worked hard to isolate the
> > regression.
> > I'm using the 0.9.x branch (tested at 2011-09-07).
> >
> > I've an UDF function that takes a bag as input:
> >
> > public DataBag exec(Tuple input) throws IOException
> > {
> > /* Get the activity bag */
> > DataBag activityBag = (DataBag) input.get(2);
> > …
> >
> > My input data are read form a text file 'activity' (same issue when they
> > are read from HBase):
> > 00,1239698069000, <- this is the line that is not correctly handled
> > 01,1239698505000,b
> > 01,1239698369000,a
> > 02,1239698413000,b
> > 02,1239698553000,c
> > 02,1239698313000,a
> > 03,1239698316000,a
> > 03,1239698516000,c
> > 03,1239698416000,b
> > 03,1239698621000,d
> > 04,1239698417000,c
> >
> > My first script is working correctly:
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group,
> > MyUDF(activities.(timestamp, name));
> > store activities;
> >
> > N.B. the name of the first activity is correctly set to null in my UDF
> > function.
> >
> > The issue occurs when I store my data into a binary file are relaod them
> > before processing (I do this to improve the computation time, since HDFS
> is
> > much faster than HBase).
> >
> > Second script that triggers an error (this script work correctly with PIG
> > 0.8.1):
> >
> > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > timestamp:long, name:chararray);
> > activities = GROUP activities BY sid;
> > activities = FOREACH activities GENERATE group, activities.(timestamp,
> > name);
> > STORE activities INTO 'activities' USING BinStorage;
> > activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> > activities:bag { activity: (timestamp:long, name:chararray) });
> > activities = FOREACH activities GENERATE sid, MyUDF(activities);
> > store activities;
> >
> > In this script, when MyUDF is calles, activityBag is null, and a warning
> is
> > issued:
> >
> > 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> >
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> > Unable to interpret value {(1239698069000,)} in field being converted to
> > type bag, caught ParseException <Cannot convert (1239698069000,) to
> > null:(timestamp:long,name:**chararray)> field discarded
> >
> > I guess that the regression is located into BinStorage
> >
> > Le 30/08/11 19:13, Daniel Dai a écrit :
> >
> >> Interesting, the log message seems to be clear, "Cannot convert
> >> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
> >> cannot find an explanation to that. I verified such conversion should
> >> be valid on 0.9. Can you show me the script?
> >>
> >> Daniel
> >>
> >> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<vi...@gmail.com>
> >>  wrote:
> >>
> >>> Hi,
> >>>
> >>> I have experienced the same issue by loading the data from raw text
> files
> >>> (using PIG server in local mode and the regular PIG loader) and from
> >>> HBaseStorage.
> >>> The issue is exactly the same in both cases: each time a NULL string is
> >>> encountered, the cast to a data bag cannot be done.
> >>>
> >>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
> >>>
> >>>> How are you loading this data?
> >>>>
> >>>> D
> >>>>
> >>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
> >>>> Barat<vi...@gmail.com>**wrote:
> >>>>
> >>>>  I'm currently testing PIG 0.9.x branch.
> >>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail
> >>>>> due
> >>>>> to a cast error returning a null pointer in one of my UDF function.
> >>>>>
> >>>>> Apparently, PIG seems to be unable to convert some data to a bag when
> >>>>> some
> >>>>> of the tuple fields are null:
> >>>>>
> >>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
> >>>>>
> >>>>>
> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
> >>>>> expressionOperators.POCast:
> >>>>> Unable to interpret value {(1239698069000,)} in field being converted
> >>>>> to
> >>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
> >>>>> null:(timestamp:long,name:****chararray)>   field discarded
> >>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
> >>>>> job_local_0019
> >>>>>
> >>>>> My UDF functions is:
> >>>>>
> >>>>>  /**
> >>>>>   *...
> >>>>>   * @param start start of the session (in milliseconds since epoch)
> >>>>>   * @param end end of the session (in milliseconds since epoch)
> >>>>>   * @param activities a bag containing a set of activities in the
> form
> >>>>> of
> >>>>> a
> >>>>> set of (timestamp:long,
> >>>>>   *          name:chararray) tuples
> >>>>>   * ...
> >>>>>   */
> >>>>>  public DataBag exec(Tuple input) throws IOException
> >>>>>  {
> >>>>>    /* Get session's start/end timestamps */
> >>>>>    long startSession = (Long) input.get(0);
> >>>>>    long endSession = (Long) input.get(1);
> >>>>>
> >>>>>    /* Get the activity bag */
> >>>>>    DataBag activityBag = (DataBag) input.get(2);
> >>>>>
> >>>>>                                     ^  here
> >>>>>
> >>>>>
> >>>>> Is that a regression ? Any idea to fix this ?
> >>>>>
> >>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
> >>>>>
> >>>>>
>

Re: PIG regression between 0.8.1 and 0.9.x

Posted by Ashutosh Chauhan <ha...@apache.org>.
Vincent,

Thanks for your hard work in isolating the bug. Its a perfect bug report.
Seems like its a regression. Can you please open a jira with test data and
script (which works in 0.8.1 and fails in 0.9)

Ashutosh

On Wed, Sep 7, 2011 at 07:17, Vincent Barat <vi...@gmail.com> wrote:

> Hi,
>
> I really need your help on this one! I've worked hard to isolate the
> regression.
> I'm using the 0.9.x branch (tested at 2011-09-07).
>
> I've an UDF function that takes a bag as input:
>
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(2);
> …
>
> My input data are read form a text file 'activity' (same issue when they
> are read from HBase):
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
>
> My first script is working correctly:
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group,
> MyUDF(activities.(timestamp, name));
> store activities;
>
> N.B. the name of the first activity is correctly set to null in my UDF
> function.
>
> The issue occurs when I store my data into a binary file are relaod them
> before processing (I do this to improve the computation time, since HDFS is
> much faster than HBase).
>
> Second script that triggers an error (this script work correctly with PIG
> 0.8.1):
>
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp,
> name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities;
>
> In this script, when MyUDF is calles, activityBag is null, and a warning is
> issued:
>
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> Unable to interpret value {(1239698069000,)} in field being converted to
> type bag, caught ParseException <Cannot convert (1239698069000,) to
> null:(timestamp:long,name:**chararray)> field discarded
>
> I guess that the regression is located into BinStorage
>
> Le 30/08/11 19:13, Daniel Dai a écrit :
>
>> Interesting, the log message seems to be clear, "Cannot convert
>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>> cannot find an explanation to that. I verified such conversion should
>> be valid on 0.9. Can you show me the script?
>>
>> Daniel
>>
>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<vi...@gmail.com>
>>  wrote:
>>
>>> Hi,
>>>
>>> I have experienced the same issue by loading the data from raw text files
>>> (using PIG server in local mode and the regular PIG loader) and from
>>> HBaseStorage.
>>> The issue is exactly the same in both cases: each time a NULL string is
>>> encountered, the cast to a data bag cannot be done.
>>>
>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>
>>>> How are you loading this data?
>>>>
>>>> D
>>>>
>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>> Barat<vi...@gmail.com>**wrote:
>>>>
>>>>  I'm currently testing PIG 0.9.x branch.
>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail
>>>>> due
>>>>> to a cast error returning a null pointer in one of my UDF function.
>>>>>
>>>>> Apparently, PIG seems to be unable to convert some data to a bag when
>>>>> some
>>>>> of the tuple fields are null:
>>>>>
>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>>>
>>>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
>>>>> expressionOperators.POCast:
>>>>> Unable to interpret value {(1239698069000,)} in field being converted
>>>>> to
>>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>>> null:(timestamp:long,name:****chararray)>   field discarded
>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>>>> job_local_0019
>>>>>
>>>>> My UDF functions is:
>>>>>
>>>>>  /**
>>>>>   *...
>>>>>   * @param start start of the session (in milliseconds since epoch)
>>>>>   * @param end end of the session (in milliseconds since epoch)
>>>>>   * @param activities a bag containing a set of activities in the form
>>>>> of
>>>>> a
>>>>> set of (timestamp:long,
>>>>>   *          name:chararray) tuples
>>>>>   * ...
>>>>>   */
>>>>>  public DataBag exec(Tuple input) throws IOException
>>>>>  {
>>>>>    /* Get session's start/end timestamps */
>>>>>    long startSession = (Long) input.get(0);
>>>>>    long endSession = (Long) input.get(1);
>>>>>
>>>>>    /* Get the activity bag */
>>>>>    DataBag activityBag = (DataBag) input.get(2);
>>>>>
>>>>>                                     ^  here
>>>>>
>>>>>
>>>>> Is that a regression ? Any idea to fix this ?
>>>>>
>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>>>
>>>>>

PIG regression between 0.8.1 and 0.9.x

Posted by Vincent Barat <vi...@gmail.com>.
Hi,

I really need your help on this one! I've worked hard to isolate the 
regression.
I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(2);
…

My input data are read form a text file 'activity' (same issue when 
they are read from HBase):
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c

My first script is working correctly:

activities = LOAD 'activity' USING PigStorage(',') AS 
(sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, 
MyUDF(activities.(timestamp, name));
store activities;

N.B. the name of the first activity is correctly set to null in my 
UDF function.

The issue occurs when I store my data into a binary file are relaod 
them before processing (I do this to improve the computation time, 
since HDFS is much faster than HBase).

Second script that triggers an error (this script work correctly 
with PIG 0.8.1):

activities = LOAD 'activity' USING PigStorage(',') AS 
(sid:chararray, timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, 
activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray, 
activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities;

In this script, when MyUDF is calles, activityBag is null, and a 
warning is issued:

2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger | 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast: 
Unable to interpret value {(1239698069000,)} in field being 
converted to type bag, caught ParseException <Cannot convert 
(1239698069000,) to null:(timestamp:long,name:chararray)> field 
discarded

I guess that the regression is located into BinStorage

Le 30/08/11 19:13, Daniel Dai a écrit :
> Interesting, the log message seems to be clear, "Cannot convert
> (1239698069000,) to null:(timestamp:long,name:chararray)", but I
> cannot find an explanation to that. I verified such conversion should
> be valid on 0.9. Can you show me the script?
>
> Daniel
>
> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<vi...@gmail.com>  wrote:
>> Hi,
>>
>> I have experienced the same issue by loading the data from raw text files
>> (using PIG server in local mode and the regular PIG loader) and from
>> HBaseStorage.
>> The issue is exactly the same in both cases: each time a NULL string is
>> encountered, the cast to a data bag cannot be done.
>>
>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>> How are you loading this data?
>>>
>>> D
>>>
>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>> Barat<vi...@gmail.com>wrote:
>>>
>>>> I'm currently testing PIG 0.9.x branch.
>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail due
>>>> to a cast error returning a null pointer in one of my UDF function.
>>>>
>>>> Apparently, PIG seems to be unable to convert some data to a bag when
>>>> some
>>>> of the tuple fields are null:
>>>>
>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>>
>>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>>>> Unable to interpret value {(1239698069000,)} in field being converted to
>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>> null:(timestamp:long,name:**chararray)>   field discarded
>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>>> job_local_0019
>>>>
>>>> My UDF functions is:
>>>>
>>>>   /**
>>>>    *...
>>>>    * @param start start of the session (in milliseconds since epoch)
>>>>    * @param end end of the session (in milliseconds since epoch)
>>>>    * @param activities a bag containing a set of activities in the form of
>>>> a
>>>> set of (timestamp:long,
>>>>    *          name:chararray) tuples
>>>>    * ...
>>>>    */
>>>>   public DataBag exec(Tuple input) throws IOException
>>>>   {
>>>>     /* Get session's start/end timestamps */
>>>>     long startSession = (Long) input.get(0);
>>>>     long endSession = (Long) input.get(1);
>>>>
>>>>     /* Get the activity bag */
>>>>     DataBag activityBag = (DataBag) input.get(2);
>>>>
>>>>                                      ^  here
>>>>
>>>>
>>>> Is that a regression ? Any idea to fix this ?
>>>>
>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>>

Re: PIG behavior changed between 0.8.1 and 0.9.x ?

Posted by Daniel Dai <da...@hortonworks.com>.
Interesting, the log message seems to be clear, "Cannot convert
(1239698069000,) to null:(timestamp:long,name:chararray)", but I
cannot find an explanation to that. I verified such conversion should
be valid on 0.9. Can you show me the script?

Daniel

On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat <vi...@gmail.com> wrote:
> Hi,
>
> I have experienced the same issue by loading the data from raw text files
> (using PIG server in local mode and the regular PIG loader) and from
> HBaseStorage.
> The issue is exactly the same in both cases: each time a NULL string is
> encountered, the cast to a data bag cannot be done.
>
> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>
>> How are you loading this data?
>>
>> D
>>
>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>> Barat<vi...@gmail.com>wrote:
>>
>>> I'm currently testing PIG 0.9.x branch.
>>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail due
>>> to a cast error returning a null pointer in one of my UDF function.
>>>
>>> Apparently, PIG seems to be unable to convert some data to a bag when
>>> some
>>> of the tuple fields are null:
>>>
>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>>> Unable to interpret value {(1239698069000,)} in field being converted to
>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>> null:(timestamp:long,name:**chararray)>  field discarded
>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>> job_local_0019
>>>
>>> My UDF functions is:
>>>
>>>  /**
>>>   *...
>>>   * @param start start of the session (in milliseconds since epoch)
>>>   * @param end end of the session (in milliseconds since epoch)
>>>   * @param activities a bag containing a set of activities in the form of
>>> a
>>> set of (timestamp:long,
>>>   *          name:chararray) tuples
>>>   * ...
>>>   */
>>>  public DataBag exec(Tuple input) throws IOException
>>>  {
>>>    /* Get session's start/end timestamps */
>>>    long startSession = (Long) input.get(0);
>>>    long endSession = (Long) input.get(1);
>>>
>>>    /* Get the activity bag */
>>>    DataBag activityBag = (DataBag) input.get(2);
>>>
>>>                                     ^  here
>>>
>>>
>>> Is that a regression ? Any idea to fix this ?
>>>
>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>
>

Re: PIG behavior changed between 0.8.1 and 0.9.x ?

Posted by Vincent Barat <vi...@gmail.com>.
Hi,

I have experienced the same issue by loading the data from raw text 
files (using PIG server in local mode and the regular PIG loader) 
and from HBaseStorage.
The issue is exactly the same in both cases: each time a NULL string 
is encountered, the cast to a data bag cannot be done.

Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
> How are you loading this data?
>
> D
>
> On Mon, Aug 29, 2011 at 8:05 AM, Vincent Barat<vi...@gmail.com>wrote:
>
>> I'm currently testing PIG 0.9.x branch.
>> Several of my jobs that use to work correctly with PIG 0.8.1 now fail due
>> to a cast error returning a null pointer in one of my UDF function.
>>
>> Apparently, PIG seems to be unable to convert some data to a bag when some
>> of the tuple fields are null:
>>
>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>> Unable to interpret value {(1239698069000,)} in field being converted to
>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>> null:(timestamp:long,name:**chararray)>  field discarded
>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>> job_local_0019
>>
>> My UDF functions is:
>>
>>   /**
>>    *...
>>    * @param start start of the session (in milliseconds since epoch)
>>    * @param end end of the session (in milliseconds since epoch)
>>    * @param activities a bag containing a set of activities in the form of a
>> set of (timestamp:long,
>>    *          name:chararray) tuples
>>    * ...
>>    */
>>   public DataBag exec(Tuple input) throws IOException
>>   {
>>     /* Get session's start/end timestamps */
>>     long startSession = (Long) input.get(0);
>>     long endSession = (Long) input.get(1);
>>
>>     /* Get the activity bag */
>>     DataBag activityBag = (DataBag) input.get(2);
>>
>>                                      ^  here
>>
>>
>> Is that a regression ? Any idea to fix this ?
>>
>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>

Re: PIG behavior changed between 0.8.1 and 0.9.x ?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
How are you loading this data?

D

On Mon, Aug 29, 2011 at 8:05 AM, Vincent Barat <vi...@gmail.com>wrote:

> I'm currently testing PIG 0.9.x branch.
> Several of my jobs that use to work correctly with PIG 0.8.1 now fail due
> to a cast error returning a null pointer in one of my UDF function.
>
> Apparently, PIG seems to be unable to convert some data to a bag when some
> of the tuple fields are null:
>
> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> Unable to interpret value {(1239698069000,)} in field being converted to
> type bag, caught ParseException <Cannot convert (1239698069000,) to
> null:(timestamp:long,name:**chararray)> field discarded
> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
> job_local_0019
>
> My UDF functions is:
>
>  /**
>   *...
>   * @param start start of the session (in milliseconds since epoch)
>   * @param end end of the session (in milliseconds since epoch)
>   * @param activities a bag containing a set of activities in the form of a
> set of (timestamp:long,
>   *          name:chararray) tuples
>   * ...
>   */
>  public DataBag exec(Tuple input) throws IOException
>  {
>    /* Get session's start/end timestamps */
>    long startSession = (Long) input.get(0);
>    long endSession = (Long) input.get(1);
>
>    /* Get the activity bag */
>    DataBag activityBag = (DataBag) input.get(2);
>
>                                     ^  here
>
>
> Is that a regression ? Any idea to fix this ?
>
> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>