You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Bill Graham (JIRA)" <ji...@apache.org> on 2012/09/26 23:33:07 UTC

[jira] [Commented] (PIG-1797) Problems when applying FOREACH ... GENERATE on data loaded from HBase

    [ https://issues.apache.org/jira/browse/PIG-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464185#comment-13464185 ] 

Bill Graham commented on PIG-1797:
----------------------------------

Ping. Eduardo, is this still a problem with Pig 0.10?
                
> Problems when applying FOREACH ... GENERATE on data loaded from HBase
> ---------------------------------------------------------------------
>
>                 Key: PIG-1797
>                 URL: https://issues.apache.org/jira/browse/PIG-1797
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>         Environment: Our environment consists on  Hadoop 0.20.2, HBase 0.20.6, ZooKeeper 3.3.2 and Pig 0.8.0. They are configured to run as a pseudo-distributed system. 
>            Reporter: Eduardo Galán Herrero
>            Assignee: Dmitriy V. Ryaboy
>              Labels: hbase
>         Attachments: pig-error2017.log.txt
>
>
> We defined a table at HBase and populated with some data:
> create 'tests', {NAME => 'age'}, {NAME => 'colour'}
> put 'tests', 'one', 'age', '22'
> put 'tests', 'one', 'colour', 'green'
> put 'tests', 'another', 'age', '439'
> put 'tests', 'another', 'colour', 'red'
> put 'tests', 'more', 'colour', 'grey'
> scan 'tests'                         
> ROW                          COLUMN+CELL                                                                      
>  another                     column=age:, timestamp=1294745175613, value=439                                  
>  another                     column=colour:, timestamp=1294745155873, value=red                               
>  more                        column=colour:, timestamp=1294745185331, value=grey                              
>  one                         column=age:, timestamp=1294745127129, value=22                                   
>  one                         column=colour:, timestamp=1294745144160, value=green
> We are using Pig on mapreduce mode to load data from HBase (recovering also the row key):
> > DATA = LOAD 'hbase://tests' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('age: colour:', '-loadKey') AS (row:chararray,age:int,colour:chararray);
> We make sure that data has been correcly loaded.
> > dump DATA;
> (another,439,red)
> (more,,grey)
> (one,22,green)
> > describe DATA;
> DATA: {row: chararray,age: int,colour: chararray}
> We can see that we can get good results if we use the "FOREACH .. GENERATE" structure with all the columns ($0, $1 and $2) that were loaded before:
> > b= FOREACH DATA GENERATE $0, $1, $2;
> > dump b;
> (another,439,red)
> (more,,grey)
> (one,22,green)
> no matter the order...
> c= FOREACH DATA GENERATE $2, $0, $1;
> dump c;
> (red,another,439)
> (grey,more,)
> (green,one,22)
> but if we don't include some column (in our example, we don't use $2 column) in the "FOREACH .. GENERATE" structure, then we get the following bug:
> > d= FOREACH DATA GENERATE $0, $1;
> > dump d;
> (another,)
> (more,)
> (one,)
> > describe d;                     
> d: {row: chararray,age: int}
> Here is another example of the bug:
> > e= FOREACH DATA GENERATE $1, $2;
> > dump e;
> (,439)
> (,)
> (,22)
> > describe e;
> e: {age: int,colour: chararray}
> Here is one more example of the bug:
> > f= FOREACH DATA GENERATE $0, $2;
> > dump f;
> (another,another)
> (more,more)
> (one,one)
> > describe f;
> f: {row: chararray,colour: chararray}
> Regards,
> Eduardo Galan Herrero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira