You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Edmund Day <ed...@yahoo.com> on 2014/04/21 23:14:36 UTC

XMLLoader not working

When I run the script below I get lots of '()' output. Can anyone guide me why I get no data in B (PIg version=0.12.1 and A dumps OK)
TIA!!!!



A = load 'hdfs:///user/hduser/smsCorpus_en_2012.04.30_all.xml' using org.apache.pig.piggybank.storage.XMLLoader('message') 
    as (x:chararray);
describe A;


B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x, '<message>\\n\\s*<text>(.*)</text>\\n\\s*    <source>\\n\\s*<srcNumber>(.*)</srcNumber>\\n\\s*<phoneModel (.*)/>\\n\\s*<userProfile>\\n\\s*<userID>(.*)</userID>\\n\\s*<age>(.*)</age>\\n\\s*<gender>(.*)</gender>\\n\\s*<nativeSpeaker>(.*)</nativeSpeaker>\\n\\s*<country>(.*)</country>\\n\\s*<city>(.*)</city>\\n\\s*<experience>(.*)</experience>\\n\\s*<frequency>(.*)</frequency>\\n\\s*<inputMethod>(.*)</inputMethod>\\n\\s*</userProfile>\\n\\s*</source>\\n\\s*<destination (.*)>\\n\\s*<destNumber>(.*)</destNumber>\\n\\s*</destination>\\n\\s*<messageProfile (.*)/>\\n\\s*<collectionMethod (.*)/>\\n\\s*</message>')) 
as (SMStext:chararray, srcNumber:chararray, phoneModel:chararray,
    userID:chararray, age:chararray, gender:chararray, nativeSpeaker:chararray, 
    country:chararray, city:chararray, experience:chararray, frequency:chararray, 
    inputMethod:chararray, destination:chararray, destNumber:chararray, messageProfile:chararray, collectionMethod:chararray);
     
describe B;
dump B;
    
    /* EXAMPLE DATA FROM NUS SMS CORPUS 

<message id="1">
      <text>K</text>
      <source>
    <srcNumber>79780a9dbe83fd1e5dd2bd2543e7da2a</srcNumber>
    <phoneModel manufactuer="Nokia" smartphone="unknown"/>
    <userProfile>
      <userID>79780a9dbe83fd1e5dd2bd2543e7da2a</userID>
      <age>21-25</age>
      <gender>unknown</gender>
      <nativeSpeaker>yes</nativeSpeaker>
      <country>India</country>
      <city>Tiruppur</city>
      <experience>3 to 5 years</experience>
      <frequency>More than 50 SMS daily</frequency>
      <inputMethod>Multi-tap</inputMethod>
    </userProfile>
      </source>
      <destination country="unknown">
    <destNumber>0ffc7585148560b7520931d354c00a9b</destNumber>
      </destination>
      <messageProfile language="en" time="2010.10.24 11:59" type="send"/>
      <collectionMethod collector="Tao Chen" method="SMS Export" time="2010/11"/>
    </message>

    */

Re: XMLLoader not working

Posted by Cheolsoo Park <pi...@gmail.com>.
I haven't used XMLLoader myself, so I can't give you help. But someone
recently completely rewrote it in trunk-
https://issues.apache.org/jira/browse/PIG-3865

Can you try to build piggybank.jar from trunk? ant clean piggybank.


On Mon, Apr 21, 2014 at 2:14 PM, Edmund Day <ed...@yahoo.com> wrote:

> When I run the script below I get lots of '()' output. Can anyone guide me
> why I get no data in B (PIg version=0.12.1 and A dumps OK)
> TIA!!!!
>
>
>
> A = load 'hdfs:///user/hduser/smsCorpus_en_2012.04.30_all.xml' using
> org.apache.pig.piggybank.storage.XMLLoader('message')
>     as (x:chararray);
> describe A;
>
>
> B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
> '<message>\\n\\s*<text>(.*)</text>\\n\\s*
> <source>\\n\\s*<srcNumber>(.*)</srcNumber>\\n\\s*<phoneModel
> (.*)/>\\n\\s*<userProfile>\\n\\s*<userID>(.*)</userID>\\n\\s*<age>(.*)</age>\\n\\s*<gender>(.*)</gender>\\n\\s*<nativeSpeaker>(.*)</nativeSpeaker>\\n\\s*<country>(.*)</country>\\n\\s*<city>(.*)</city>\\n\\s*<experience>(.*)</experience>\\n\\s*<frequency>(.*)</frequency>\\n\\s*<inputMethod>(.*)</inputMethod>\\n\\s*</userProfile>\\n\\s*</source>\\n\\s*<destination
> (.*)>\\n\\s*<destNumber>(.*)</destNumber>\\n\\s*</destination>\\n\\s*<messageProfile
> (.*)/>\\n\\s*<collectionMethod (.*)/>\\n\\s*</message>'))
> as (SMStext:chararray, srcNumber:chararray, phoneModel:chararray,
>     userID:chararray, age:chararray, gender:chararray,
> nativeSpeaker:chararray,
>     country:chararray, city:chararray, experience:chararray,
> frequency:chararray,
>     inputMethod:chararray, destination:chararray, destNumber:chararray,
> messageProfile:chararray, collectionMethod:chararray);
>
> describe B;
> dump B;
>
>     /* EXAMPLE DATA FROM NUS SMS CORPUS
>
> <message id="1">
>       <text>K</text>
>       <source>
>     <srcNumber>79780a9dbe83fd1e5dd2bd2543e7da2a</srcNumber>
>     <phoneModel manufactuer="Nokia" smartphone="unknown"/>
>     <userProfile>
>       <userID>79780a9dbe83fd1e5dd2bd2543e7da2a</userID>
>       <age>21-25</age>
>       <gender>unknown</gender>
>       <nativeSpeaker>yes</nativeSpeaker>
>       <country>India</country>
>       <city>Tiruppur</city>
>       <experience>3 to 5 years</experience>
>       <frequency>More than 50 SMS daily</frequency>
>       <inputMethod>Multi-tap</inputMethod>
>     </userProfile>
>       </source>
>       <destination country="unknown">
>     <destNumber>0ffc7585148560b7520931d354c00a9b</destNumber>
>       </destination>
>       <messageProfile language="en" time="2010.10.24 11:59" type="send"/>
>       <collectionMethod collector="Tao Chen" method="SMS Export"
> time="2010/11"/>
>     </message>
>
>     */