You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Krishna Kalyan <kr...@gmail.com> on 2014/12/29 08:11:48 UTC

Incorrect Dump using HBase Storage Class

Hi,
Happy holidays :).
I have 2 different pig scripts with the statement below
(1)
GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true');

and
(2)
GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true') as
(postcode,geog_id,pc_sector,district_code,postal_town,postal_county,mosaic_code,mosaic_code_desc,mosaic_group,sales_territory,sales_area,sales_region,dqtimestamp,checkarray);

the only difference is as statement.

now for example
A foreach of $0,$4,$5 and a dump gives me different results for statement 1
and 2.
where 1 is correct.

Has anyone faced this behavior before?.

Regards,
Krishna

Streaming.XMLLoader not working on PIG

Posted by harry Shah <hr...@hotmail.com>.
 
 Hi I am new to PIG scripting.

 I am trying to parse XML values through a pig script but getting the error.
> >
 ERROR 1070: Could not resolve 
org.apache.pig.piggybank.storage.StreamingXMLLoader using imports: [, 
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] at > > 

my XML file is this

<?xml version="1.0" encoding="ISO-8859-1"?> 
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>

<current_observation version="1.0"
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema"
	 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	 
xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observati
on.xsd">
	<credit>NOAA's National Weather Service</credit>
	<location>Unknown Station</location>
	<station_id>SH007</station_id>
        <temperature_string>32.0 F (0.0 C)</temperature_string>
	<temp_f>32.0</temp_f>
	<temp_c>0.0</temp_c>
	<water_temp_f>32.0</water_temp_f>
	<water_temp_c>0.0</water_temp_c>
	<wind_string>Calm</wind_string>
	<wind_dir>North</wind_dir>
	<wind_degrees>0</wind_degrees>
	<wind_mph>0.0</wind_mph>
	<wind_gust_mph>0.0</wind_gust_mph>
	<pressure_string>1019.0 mb</pressure_string>
	<privacy_policy_url>http://weather.gov/notice.html</privacy_policy_u
rl>
</current_observation>

**************************************************************************
I want to extract location, station_id, temp_c and wind_dir I tried writing 
three pig scripts first two scripts ar working but no output. third script 
is giving me above error.
I am using Hadoop version 2.5 and Pig version 0.14

I think problem is with root element which is also carrying attributes with 
it pls suggest me what to do with this issue.
  
and my pig scripts are (I tried 3 of them)

1. 



REGISTER /home/hduser/Desktop/apache_pig/lib/piggybank.jar;

A = LOAD '/demo1/SH007.xml' USING 
org.apache.pig.piggybank.storage.XMLLoader('location') AS (x:chararray);

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<location>(.*)
</location>\\s*<temp_c>(.*)</temp_c>\\s*<pressure_string>(.*)
</pressure_string>'))
AS (location:chararray,temp_c:int,pressure_string:chararray);

dump B;



2.


REGISTER /home/hduser/Desktop/apache_pig/lib/piggybank.jar;

DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
 

A = LOAD '/demo1/SH007.xml' using 
org.apache.pig.piggybank.storage.XMLLoader('current_observation') as 
(x:chararray);
 
B = FOREACH A GENERATE XPath(x, 'current_observation/location'), XPath(x, 
'current_observation/temp_c');

dump B;



3.



REGISTER /home/hduser/Desktop/pig-0.14.0/lib/piggybank.jar;

data = LOAD '/demo1/SH007.xml'
       USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
          'current_observation',
          'location'
       ) AS (
           location:    {(attr:map[], content:chararray)}
       );

dump data;


Pls do the needful

Thank you 
Harry


Re: Incorrect Dump using HBase Storage Class

Posted by Ted Yu <yu...@gmail.com>.
Can you pastebin the output for both queries ?

What version of hbase are you using ?

Cheers

On Sun, Dec 28, 2014 at 11:11 PM, Krishna Kalyan <kr...@gmail.com>
wrote:

> Hi,
> Happy holidays :).
> I have 2 different pig scripts with the statement below
> (1)
> GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
> cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
> cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
> cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
> cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
> '-loadKey true');
>
> and
> (2)
> GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
> cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
> cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
> cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
> cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
> '-loadKey true') as
>
> (postcode,geog_id,pc_sector,district_code,postal_town,postal_county,mosaic_code,mosaic_code_desc,mosaic_group,sales_territory,sales_area,sales_region,dqtimestamp,checkarray);
>
> the only difference is as statement.
>
> now for example
> A foreach of $0,$4,$5 and a dump gives me different results for statement 1
> and 2.
> where 1 is correct.
>
> Has anyone faced this behavior before?.
>
> Regards,
> Krishna
>