You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Nezih Yigitbasi (JIRA)" <ji...@apache.org> on 2014/01/24 03:17:40 UTC

[jira] [Commented] (PIG-3628) When using UNION with 2 HbaseStorages, casting to chararray results in empty string

    [ https://issues.apache.org/jira/browse/PIG-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880638#comment-13880638 ] 

Nezih Yigitbasi commented on PIG-3628:
--------------------------------------

The reason is HBaseStorage loads all tuples as DataByteArrays even if you specify your load schema as map[chararray] (where you expect map elements to be chararrays). Therefore,  when I run the script above (and debug it further) I see that SUBSTRING throws an exception (java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String), which it then catches and logs a UDF_WARNING_4.  Ideally I would expect the map contents should be chararrays instead of byte arrays, I think that deserves a fix.

Anyway, a better solution than dumping to a temp file is to not use map when loading from HBase, but load the date_created field as chararray, which helps proper casting of loaded bytearrays to chararrays. The following script works OK:

hbs1 = LOAD 'hbase://test_table1'
        USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
               'f:date_created','-loadKey true')
               AS ( id:bytearray, date_created:chararray);
               
hbs2 = LOAD 'hbase://test_table2'
        USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
               'f:date_created','-loadKey true')
               AS ( id:bytearray, date_created:chararray);

hbs3 = UNION hbs1, hbs2;

hbs5 = FOREACH hbs3 GENERATE  id  , date_created,  SUBSTRING(date_created,1,10) as date_created_trunc;  

You get the output:
(2-1386066912074,2012-01-04T11:33:59:05321,012-01-04)
(1-1386066912072,2012-01-04T11:33:59:05321,012-01-04)



> When using UNION with 2 HbaseStorages, casting to chararray results in empty string
> -----------------------------------------------------------------------------------
>
>                 Key: PIG-3628
>                 URL: https://issues.apache.org/jira/browse/PIG-3628
>             Project: Pig
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.11
>         Environment: CDH5, Centos 6
>            Reporter: Jochem van Grondelle
>            Priority: Minor
>
> Hi,
> We stumbled upon the following issue. I am wondering if anyone can help us with it. I am available for any follow up questions. Unfortunately, I am not a Java programmer, so I cannot supply a fix if this actually is a bug.
> It seems that the following issue is specific to the HbaseLoader, but I am not sure. When using any other loaders (two times PigStorage), the problem doesn't exist. 
> It seems that even when we specifiy 'content:map [ chararray ] ' when loading data from HBase, and Pig is saying the schema contains chararrays, still maybe in the background those fields are bytearrays that seem to be not convertable.
> First create 2 Hbase tables:
> {code}
> --hbase shell
> --
> --hbase(main):001:0> create 'test_table1','f'
> --0 row(s) in 20.0530 seconds
> --
> --hbase(main):002:0> create 'test_table2', 'f'
> --0 row(s) in 1.4420 seconds
> --
> --hbase(main):008:0> put 'test_table1','1-1386066912072','f:date_created','2012-01-04T11:33:59:05321'
> --0 row(s) in 5.3380 seconds
> --
> --hbase(main):002:0> put 'test_table2','2-1386066912074','f:date_created','2012-01-04T11:33:59:05321'
> --0 row(s) in 0.0540 seconds
> --
> --
> --hbase(main):003:0> quit
> {code}
> -- Then run the following Pig script:
> {code}
> hbs1 = LOAD 'hbase://test_table1'
>         USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>                'f:*','-loadKey true')
>                AS ( id:bytearray, content:map[chararray]);
>                
> hbs2 = LOAD 'hbase://test_table2'
>         USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>                'f:*','-loadKey true')
>                AS ( id:bytearray, content:map[chararray]);
> hbs3 = UNION hbs1, hbs2;
> hbs4 = FOREACH hbs3
> GENERATE        id as hbase_id               
>                , flatten(content#'date_created') as date_created                   
>                ;   
> hbs5 = FOREACH hbs4
> GENERATE        hbase_id   
>               , date_created  --without (chararray)           
>               ,  SUBSTRING( date_created,1,10) as date_created_trunc              
>             ;
>               
> DUMP hbs5;
> {code}
> *Result*
> {code}
> (2-1386066912074,2012-01-04T11:33:59:05321,)
> (1-1386066912072,2012-01-04T11:33:59:05321,)
> {code}
> *Expected result*
> {code}
> (2-1386066912074,2012-01-04T11:33:59:05321,2012-01-04)
> (1-1386066912072,2012-01-04T11:33:59:05321,2012-01-04)
> {code}
> The Substring function in combination with the date_created is just for example purposes. There are several String functions that we want to be able to use.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)