You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "HAJIHASHEMI, ZAHRA (AG/1000)" <za...@monsanto.com> on 2012/09/24 16:30:53 UTC

RE: Pig help

Hi,



I have a text file which has my hbase table information. It is comma separated. The first is attribute name (which I want it to be as column qualifier) and the second is attribute value. The file looks like this:

COMMON_NAME,corn

SCIENTIFIC_NAME,Zea mays

GENETIC_BACKGROUND,LH244

TISSUE,tassel

DEV_STAGE,V7-V8

TREATMENT,"Microspore mothercell stage (V7-V8), <0.5in"

ECTOPIC_TYPE,



So I want to load this file and store it into the hbase table. The table schema is discovery_rnaseq_library (A: attribute_name, value: attribute_value)

Here is my pig script:


library_tag = LOAD '/my_path/345_lib_description.txt' USING PigStorage(',') AS (tag:chararray, value:chararray);
library_id = LOAD 'discovery_rnaseq_library' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:organism' ,'-loadKey true') as (id:int, name:chararray);
grpd = group library_id all;
data_id = foreach grpd generate ((MAX(library_id.id))+1) as id;
finalData = CROSS data_id, library_tag;
STORE library_tag INTO 'hbase://library' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:tag A:value');



And I also scan the table to get the max id and use it to insert new record.



The problem is it just insert all of the records with "tag" as column qualifier. Here is what I get after running this pig script:

COLUMN                               CELL

 A:tag                               timestamp=1348451755196, value=REPLICATE_NUMBER

 A:value                             timestamp=1348451755196, value=1

Whereas I want it to be something like this:
COLUMN                               CELL
A: COMMON_NAME      timestamp=1348451755196, value=corn

A:SCIENTIFIC_NAME   timestamp=1348451755196, value= Zea mays

...


I highly appreciate any comments.

Thanks!

-Zara

This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.

Re: Pig help

Posted by Bill Graham <bi...@gmail.com>.
HBaseStorage writes to the column descriptors specified in the constructor
and in your case you're telling it to use 'A:tag A:value'. If you want to
write to other columns you need to statically define them there.

If you want to use a dynamic column names, you could subclass HBaseStorage
and re-implement the part where the HBase Put happens to use your tag value
as the column descriptor instead of a static column list.

On Mon, Sep 24, 2012 at 7:30 AM, HAJIHASHEMI, ZAHRA (AG/1000) <
zahra.hajihashemi@monsanto.com> wrote:

> Hi,
>
>
>
> I have a text file which has my hbase table information. It is comma
> separated. The first is attribute name (which I want it to be as column
> qualifier) and the second is attribute value. The file looks like this:
>
> COMMON_NAME,corn
>
> SCIENTIFIC_NAME,Zea mays
>
> GENETIC_BACKGROUND,LH244
>
> TISSUE,tassel
>
> DEV_STAGE,V7-V8
>
> TREATMENT,"Microspore mothercell stage (V7-V8), <0.5in"
>
> ECTOPIC_TYPE,
>
>
>
> So I want to load this file and store it into the hbase table. The table
> schema is discovery_rnaseq_library (A: attribute_name, value:
> attribute_value)
>
> Here is my pig script:
>
>
> library_tag = LOAD '/my_path/345_lib_description.txt' USING
> PigStorage(',') AS (tag:chararray, value:chararray);
> library_id = LOAD 'discovery_rnaseq_library' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:organism' ,'-loadKey
> true') as (id:int, name:chararray);
> grpd = group library_id all;
> data_id = foreach grpd generate ((MAX(library_id.id))+1) as id;
> finalData = CROSS data_id, library_tag;
> STORE library_tag INTO 'hbase://library' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:tag A:value');
>
>
>
> And I also scan the table to get the max id and use it to insert new
> record.
>
>
>
> The problem is it just insert all of the records with "tag" as column
> qualifier. Here is what I get after running this pig script:
>
> COLUMN                               CELL
>
>  A:tag                               timestamp=1348451755196,
> value=REPLICATE_NUMBER
>
>  A:value                             timestamp=1348451755196, value=1
>
> Whereas I want it to be something like this:
> COLUMN                               CELL
> A: COMMON_NAME      timestamp=1348451755196, value=corn
>
> A:SCIENTIFIC_NAME   timestamp=1348451755196, value= Zea mays
>
> ...
>
>
> I highly appreciate any comments.
>
> Thanks!
>
> -Zara
>
> This e-mail message may contain privileged and/or confidential
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other
> use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*