You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2011/10/06 18:09:48 UTC

PIG regression in 0.9.1's BinStorage()

Hi,

I made more investigation and I updated the issue to provide a very 
easy way to reproduce it.
This seems to be an important regression in BinStorage()

https://issues.apache.org/jira/browse/PIG-2271

Le 09/09/11 11:36, Vincent Barat a écrit :
> Issue reported:
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 07/09/11 20:52, Kevin Burton a écrit :
>> I believe that everything is byte array at first but I may be 
>> wrong… at
>> least this has been the situation in my experiments.
>>
>> It is best to always specify schema though.  Unless you're using 
>> Zebra which
>> stores the schema directly (which is very handy btw).
>>
>> You could also try InterStorage (which you can use directly via 
>> the full
>> classname) as it is more efficient if I recall correctly.
>>
>> While it probably would be nice for you to submit a bug and of 
>> course you
>> can wait until it is fixed, it's probably faster for you to just 
>> work around
>> it…
>>
>> Kevin
>>
>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<co...@tynt.com>  
>> wrote:
>>
>>> Hi there,
>>>
>>> I think we might be seeing something related to this problem and 
>>> can
>>> confirm
>>> it's in BinStorage for us.
>>>
>>> We stored referrer_stats_by_site using BinStorage.  Here is a 
>>> describe of
>>> the alias:
>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>> chararray,lcnt:
>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>
>>> Now we try to load that data:
>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>> tcnt:long,
>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>> tcnt:long)})});
>>>
>>> but when we do we cannot find a certain 'site'.
>>>
>>> When we don't provide the schema:
>>> referrers = LOAD 'mydata' USING BinStorage();
>>>
>>> It will load but referrerdomain is a bytearray instead of 
>>> chararray.  Is
>>> pig
>>> supposed to automatically cast this to a chararray for me?  Is 
>>> there any
>>> reason why this data won't load unless we change the type to 
>>> bytearray?
>>>
>>>
>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh 
>>> Chauhan<hashutosh@apache.org
>>>> wrote:
>>>> Vincent,
>>>>
>>>> Thanks for your hard work in isolating the bug. Its a perfect 
>>>> bug report.
>>>> Seems like its a regression. Can you please open a jira with 
>>>> test data
>>> and
>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>
>>>> Ashutosh
>>>>
>>>> On Wed, Sep 7, 2011 at 07:17, Vincent 
>>>> Barat<vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I really need your help on this one! I've worked hard to 
>>>>> isolate the
>>>>> regression.
>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>
>>>>> I've an UDF function that takes a bag as input:
>>>>>
>>>>> public DataBag exec(Tuple input) throws IOException
>>>>> {
>>>>> /* Get the activity bag */
>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>> …
>>>>>
>>>>> My input data are read form a text file 'activity' (same issue 
>>>>> when
>>> they
>>>>> are read from HBase):
>>>>> 00,1239698069000,<- this is the line that is not correctly 
>>>>> handled
>>>>> 01,1239698505000,b
>>>>> 01,1239698369000,a
>>>>> 02,1239698413000,b
>>>>> 02,1239698553000,c
>>>>> 02,1239698313000,a
>>>>> 03,1239698316000,a
>>>>> 03,1239698516000,c
>>>>> 03,1239698416000,b
>>>>> 03,1239698621000,d
>>>>> 04,1239698417000,c
>>>>>
>>>>> My first script is working correctly:
>>>>>
>>>>> activities = LOAD 'activity' USING PigStorage(',') AS 
>>>>> (sid:chararray,
>>>>> timestamp:long, name:chararray);
>>>>> activities = GROUP activities BY sid;
>>>>> activities = FOREACH activities GENERATE group,
>>>>> MyUDF(activities.(timestamp, name));
>>>>> store activities;
>>>>>
>>>>> N.B. the name of the first activity is correctly set to null 
>>>>> in my UDF
>>>>> function.
>>>>>
>>>>> The issue occurs when I store my data into a binary file are 
>>>>> relaod
>>> them
>>>>> before processing (I do this to improve the computation time, 
>>>>> since
>>> HDFS
>>>> is
>>>>> much faster than HBase).
>>>>>
>>>>> Second script that triggers an error (this script work 
>>>>> correctly with
>>> PIG
>>>>> 0.8.1):
>>>>>
>>>>> activities = LOAD 'activity' USING PigStorage(',') AS 
>>>>> (sid:chararray,
>>>>> timestamp:long, name:chararray);
>>>>> activities = GROUP activities BY sid;
>>>>> activities = FOREACH activities GENERATE group, 
>>>>> activities.(timestamp,
>>>>> name);
>>>>> STORE activities INTO 'activities' USING BinStorage;
>>>>> activities = LOAD 'activities' USING BinStorage AS 
>>>>> (sid:chararray,
>>>>> activities:bag { activity: (timestamp:long, name:chararray) });
>>>>> activities = FOREACH activities GENERATE sid, MyUDF(activities);
>>>>> store activities;
>>>>>
>>>>> In this script, when MyUDF is calles, activityBag is null, and a
>>> warning
>>>> is
>>>>> issued:
>>>>>
>>>>> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
>>>>>
>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast: 
>>>
>>>>> Unable to interpret value {(1239698069000,)} in field being 
>>>>> converted
>>> to
>>>>> type bag, caught ParseException<Cannot convert 
>>>>> (1239698069000,) to
>>>>> null:(timestamp:long,name:**chararray)>  field discarded
>>>>>
>>>>> I guess that the regression is located into BinStorage
>>>>>
>>>>> Le 30/08/11 19:13, Daniel Dai a écrit :
>>>>>
>>>>>> Interesting, the log message seems to be clear, "Cannot convert
>>>>>> (1239698069000,) to null:(timestamp:long,name:**chararray)", 
>>>>>> but I
>>>>>> cannot find an explanation to that. I verified such 
>>>>>> conversion should
>>>>>> be valid on 0.9. Can you show me the script?
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
>>> vincent.barat@gmail.com>
>>>>>>   wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have experienced the same issue by loading the data from 
>>>>>>> raw text
>>>> files
>>>>>>> (using PIG server in local mode and the regular PIG loader) 
>>>>>>> and from
>>>>>>> HBaseStorage.
>>>>>>> The issue is exactly the same in both cases: each time a 
>>>>>>> NULL string
>>> is
>>>>>>> encountered, the cast to a data bag cannot be done.
>>>>>>>
>>>>>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>>>>>
>>>>>>>> How are you loading this data?
>>>>>>>>
>>>>>>>> D
>>>>>>>>
>>>>>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>>>>>> Barat<vi...@gmail.com>**wrote:
>>>>>>>>
>>>>>>>>   I'm currently testing PIG 0.9.x branch.
>>>>>>>>> Several of my jobs that use to work correctly with PIG 
>>>>>>>>> 0.8.1 now
>>> fail
>>>>>>>>> due
>>>>>>>>> to a cast error returning a null pointer in one of my UDF 
>>>>>>>>> function.
>>>>>>>>>
>>>>>>>>> Apparently, PIG seems to be unable to convert some data to 
>>>>>>>>> a bag
>>> when
>>>>>>>>> some
>>>>>>>>> of the tuple fields are null:
>>>>>>>>>
>>>>>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | 
>>>>>>>>> PigHadoopLogger |
>>>>>>>>>
>>>>>>>>>
>>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.**** 
>>>>
>>>>>>>>> expressionOperators.POCast:
>>>>>>>>> Unable to interpret value {(1239698069000,)} in field being
>>> converted
>>>>>>>>> to
>>>>>>>>> type bag, caught ParseException<Cannot convert 
>>>>>>>>> (1239698069000,) to
>>>>>>>>> null:(timestamp:long,name:****chararray)>    field discarded
>>>>>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | 
>>>>>>>>> LocalJobRunner |
>>>>>>>>> job_local_0019
>>>>>>>>>
>>>>>>>>> My UDF functions is:
>>>>>>>>>
>>>>>>>>>   /**
>>>>>>>>>    *...
>>>>>>>>>    * @param start start of the session (in milliseconds 
>>>>>>>>> since epoch)
>>>>>>>>>    * @param end end of the session (in milliseconds since 
>>>>>>>>> epoch)
>>>>>>>>>    * @param activities a bag containing a set of 
>>>>>>>>> activities in the
>>>> form
>>>>>>>>> of
>>>>>>>>> a
>>>>>>>>> set of (timestamp:long,
>>>>>>>>>    *          name:chararray) tuples
>>>>>>>>>    * ...
>>>>>>>>>    */
>>>>>>>>>   public DataBag exec(Tuple input) throws IOException
>>>>>>>>>   {
>>>>>>>>>     /* Get session's start/end timestamps */
>>>>>>>>>     long startSession = (Long) input.get(0);
>>>>>>>>>     long endSession = (Long) input.get(1);
>>>>>>>>>
>>>>>>>>>     /* Get the activity bag */
>>>>>>>>>     DataBag activityBag = (DataBag) input.get(2);
>>>>>>>>>
>>>>>>>>>                                      ^  here
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is that a regression ? Any idea to fix this ?
>>>>>>>>>
>>>>>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 
>>>>>>>>> 0.10.0 :-)
>>>>>>>>>
>>>>>>>>>
>>
>>
>

Re: PIG regression in 0.9.1's BinStorage()

Posted by Thejas Nair <th...@hortonworks.com>.
This was a bug in cast operation, while applying the schema specified in 
the 2nd load statement.

Patch availble in the jira (PIG-2271).

-Thejas




On 10/6/11 9:09 AM, Vincent Barat wrote:
> Hi,
>
> I made more investigation and I updated the issue to provide a very easy
> way to reproduce it.
> This seems to be an important regression in BinStorage()
>
> https://issues.apache.org/jira/browse/PIG-2271
>
> Le 09/09/11 11:36, Vincent Barat a écrit :
>> Issue reported:
>>
>> https://issues.apache.org/jira/browse/PIG-2271
>>
>> Le 07/09/11 20:52, Kevin Burton a écrit :
>>> I believe that everything is byte array at first but I may be wrong… at
>>> least this has been the situation in my experiments.
>>>
>>> It is best to always specify schema though. Unless you're using Zebra
>>> which
>>> stores the schema directly (which is very handy btw).
>>>
>>> You could also try InterStorage (which you can use directly via the full
>>> classname) as it is more efficient if I recall correctly.
>>>
>>> While it probably would be nice for you to submit a bug and of course
>>> you
>>> can wait until it is fixed, it's probably faster for you to just work
>>> around
>>> it…
>>>
>>> Kevin
>>>
>>> On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<co...@tynt.com> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I think we might be seeing something related to this problem and can
>>>> confirm
>>>> it's in BinStorage for us.
>>>>
>>>> We stored referrer_stats_by_site using BinStorage. Here is a
>>>> describe of
>>>> the alias:
>>>>> referrer_stats_by_site: {site: chararray,{(referrerdomain:
>>>> chararray,lcnt:
>>>> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>>>>
>>>> Now we try to load that data:
>>>> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
>>>> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
>>>> tcnt:long,
>>>> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
>>>> tcnt:long)})});
>>>>
>>>> but when we do we cannot find a certain 'site'.
>>>>
>>>> When we don't provide the schema:
>>>> referrers = LOAD 'mydata' USING BinStorage();
>>>>
>>>> It will load but referrerdomain is a bytearray instead of chararray. Is
>>>> pig
>>>> supposed to automatically cast this to a chararray for me? Is there any
>>>> reason why this data won't load unless we change the type to bytearray?
>>>>
>>>>
>>>> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<hashutosh@apache.org
>>>>> wrote:
>>>>> Vincent,
>>>>>
>>>>> Thanks for your hard work in isolating the bug. Its a perfect bug
>>>>> report.
>>>>> Seems like its a regression. Can you please open a jira with test data
>>>> and
>>>>> script (which works in 0.8.1 and fails in 0.9)
>>>>>
>>>>> Ashutosh
>>>>>
>>>>> On Wed, Sep 7, 2011 at 07:17, Vincent Barat<vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I really need your help on this one! I've worked hard to isolate the
>>>>>> regression.
>>>>>> I'm using the 0.9.x branch (tested at 2011-09-07).
>>>>>>
>>>>>> I've an UDF function that takes a bag as input:
>>>>>>
>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>> {
>>>>>> /* Get the activity bag */
>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>> …
>>>>>>
>>>>>> My input data are read form a text file 'activity' (same issue when
>>>> they
>>>>>> are read from HBase):
>>>>>> 00,1239698069000,<- this is the line that is not correctly handled
>>>>>> 01,1239698505000,b
>>>>>> 01,1239698369000,a
>>>>>> 02,1239698413000,b
>>>>>> 02,1239698553000,c
>>>>>> 02,1239698313000,a
>>>>>> 03,1239698316000,a
>>>>>> 03,1239698516000,c
>>>>>> 03,1239698416000,b
>>>>>> 03,1239698621000,d
>>>>>> 04,1239698417000,c
>>>>>>
>>>>>> My first script is working correctly:
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> MyUDF(activities.(timestamp, name));
>>>>>> store activities;
>>>>>>
>>>>>> N.B. the name of the first activity is correctly set to null in my
>>>>>> UDF
>>>>>> function.
>>>>>>
>>>>>> The issue occurs when I store my data into a binary file are relaod
>>>> them
>>>>>> before processing (I do this to improve the computation time, since
>>>> HDFS
>>>>> is
>>>>>> much faster than HBase).
>>>>>>
>>>>>> Second script that triggers an error (this script work correctly with
>>>> PIG
>>>>>> 0.8.1):
>>>>>>
>>>>>> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
>>>>>> timestamp:long, name:chararray);
>>>>>> activities = GROUP activities BY sid;
>>>>>> activities = FOREACH activities GENERATE group,
>>>>>> activities.(timestamp,
>>>>>> name);
>>>>>> STORE activities INTO 'activities' USING BinStorage;
>>>>>> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
>>>>>> activities:bag { activity: (timestamp:long, name:chararray) });
>>>>>> activities = FOREACH activities GENERATE sid, MyUDF(activities);
>>>>>> store activities;
>>>>>>
>>>>>> In this script, when MyUDF is calles, activityBag is null, and a
>>>> warning
>>>>> is
>>>>>> issued:
>>>>>>
>>>>>> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
>>>>>>
>>>> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
>>>>
>>>>>> Unable to interpret value {(1239698069000,)} in field being converted
>>>> to
>>>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
>>>>>> null:(timestamp:long,name:**chararray)> field discarded
>>>>>>
>>>>>> I guess that the regression is located into BinStorage
>>>>>>
>>>>>> Le 30/08/11 19:13, Daniel Dai a écrit :
>>>>>>
>>>>>>> Interesting, the log message seems to be clear, "Cannot convert
>>>>>>> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
>>>>>>> cannot find an explanation to that. I verified such conversion
>>>>>>> should
>>>>>>> be valid on 0.9. Can you show me the script?
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
>>>> vincent.barat@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have experienced the same issue by loading the data from raw text
>>>>> files
>>>>>>>> (using PIG server in local mode and the regular PIG loader) and
>>>>>>>> from
>>>>>>>> HBaseStorage.
>>>>>>>> The issue is exactly the same in both cases: each time a NULL
>>>>>>>> string
>>>> is
>>>>>>>> encountered, the cast to a data bag cannot be done.
>>>>>>>>
>>>>>>>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
>>>>>>>>
>>>>>>>>> How are you loading this data?
>>>>>>>>>
>>>>>>>>> D
>>>>>>>>>
>>>>>>>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
>>>>>>>>> Barat<vi...@gmail.com>**wrote:
>>>>>>>>>
>>>>>>>>> I'm currently testing PIG 0.9.x branch.
>>>>>>>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now
>>>> fail
>>>>>>>>>> due
>>>>>>>>>> to a cast error returning a null pointer in one of my UDF
>>>>>>>>>> function.
>>>>>>>>>>
>>>>>>>>>> Apparently, PIG seems to be unable to convert some data to a bag
>>>> when
>>>>>>>>>> some
>>>>>>>>>> of the tuple fields are null:
>>>>>>>>>>
>>>>>>>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
>>>>>>>>>>
>>>>>>>>>>
>>>>> org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
>>>>>>>>>> expressionOperators.POCast:
>>>>>>>>>> Unable to interpret value {(1239698069000,)} in field being
>>>> converted
>>>>>>>>>> to
>>>>>>>>>> type bag, caught ParseException<Cannot convert
>>>>>>>>>> (1239698069000,) to
>>>>>>>>>> null:(timestamp:long,name:****chararray)> field discarded
>>>>>>>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
>>>>>>>>>> job_local_0019
>>>>>>>>>>
>>>>>>>>>> My UDF functions is:
>>>>>>>>>>
>>>>>>>>>> /**
>>>>>>>>>> *...
>>>>>>>>>> * @param start start of the session (in milliseconds since epoch)
>>>>>>>>>> * @param end end of the session (in milliseconds since epoch)
>>>>>>>>>> * @param activities a bag containing a set of activities in the
>>>>> form
>>>>>>>>>> of
>>>>>>>>>> a
>>>>>>>>>> set of (timestamp:long,
>>>>>>>>>> * name:chararray) tuples
>>>>>>>>>> * ...
>>>>>>>>>> */
>>>>>>>>>> public DataBag exec(Tuple input) throws IOException
>>>>>>>>>> {
>>>>>>>>>> /* Get session's start/end timestamps */
>>>>>>>>>> long startSession = (Long) input.get(0);
>>>>>>>>>> long endSession = (Long) input.get(1);
>>>>>>>>>>
>>>>>>>>>> /* Get the activity bag */
>>>>>>>>>> DataBag activityBag = (DataBag) input.get(2);
>>>>>>>>>>
>>>>>>>>>> ^ here
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Is that a regression ? Any idea to fix this ?
>>>>>>>>>>
>>>>>>>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
>>>>>>>>>>
>>>>>>>>>>
>>>
>>>
>>