You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Sameer Tilak <ss...@live.com> on 2013/11/05 01:26:48 UTC

Java UDF and incompatible schema

Hi everyone,

I have written my custom parser and since my files are sm,all I am using sequence file for efficiency. Each file in the equence file has info about one user and I am parsing that file and I would like to get a bag of tuples for every user/file/.  In my Parser class I have implemented exec function that will be called for each file/user.  I then gather the info and package it as tuples. Each user will generate multiple tuples sine the file is quite rich and complex. Is it correct to assume that the  the relation AU will contact one bag per user? 

When I execute the following script, I get the following error. Any help with this would be great!
ERROR 1031: Incompatable field schema: declared is 
"bag_0:bag{:tuple(id:int,class:chararray,name:chararray,begin:int,end:int,probone:chararray,probtwo:chararray)}",
 infered is ":Unknown"


Java UDF code snippet

PopulateBag
{

                for (MyItems item : items)
                {


                    Tuple output = TupleFactory.getInstance().newTuple(7);


                    output.set(0, item.getId());

                    output.set(1, item.getClass());

                    output.set(2,item.getName());
                    
                    output.set(3,item.Begin());

                    output.set(4,item.End());

                    output.set(5,item.Probabilityone());

                    output.set(6,item.Probtwo());

                    m_defaultDataBag.add(output);


                }
 }
 
     public DefaultDataBag exec(Tuple input) throws IOException {
     
     try
           {

               this.ParseFile((String)input.get(0));
               this.PopulateBag();
           return m_defaultDataBag;
           } catch (Exception e) {
           System.err.println("Failed to process th i/p \n");
           return null;
       }
}


Pig Script

REGISTER /users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
REGISTER /users/p529444/software/pig-0.11.1/parser.jar

DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();

A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray, value: chararray);
DESCRIBE A;
STORE A into '/scratch/A';

AU = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray, name: chararray, begin: int, end: int, probone: chararray, probtwo: chararray)};

RE: Java UDF and incompatible schema

Posted by Sameer Tilak <ss...@live.com>.

Hi Pradeep,
Yes, I implemented the outputSchema method and it fixed that issue. 

We are also planning to evaluate to store intermediate and final results in Cassandra.


> Date: Mon, 4 Nov 2013 17:08:56 -0800
> Subject: Re: Java UDF and incompatible schema
> From: pradeepg26@gmail.com
> To: user@pig.apache.org
> 
> This is most likely because you haven't defined the outputSchema method of
> the UDF. The AS keyword merges the schema generated by the UDF with the
> user specified schema. If the UDF does not override the method and specify
> the output schema, it is considered null and you will not be able to use AS
> to override the schema.
> 
> Out of curiosity, if each one of your small files describes a user, is
> there any reason why you can't use a database (e.g. HBase) to store this
> information? It seems like any file based storage may not be the best
> solution given my extremely limited knowledge of your problem domain.
> 
> 
> On Mon, Nov 4, 2013 at 4:26 PM, Sameer Tilak <ss...@live.com> wrote:
> 
> > Hi everyone,
> >
> > I have written my custom parser and since my files are sm,all I am using
> > sequence file for efficiency. Each file in the equence file has info about
> > one user and I am parsing that file and I would like to get a bag of tuples
> > for every user/file/.  In my Parser class I have implemented exec function
> > that will be called for each file/user.  I then gather the info and package
> > it as tuples. Each user will generate multiple tuples sine the file is
> > quite rich and complex. Is it correct to assume that the  the relation AU
> > will contact one bag per user?
> >
> > When I execute the following script, I get the following error. Any help
> > with this would be great!
> > ERROR 1031: Incompatable field schema: declared is
> >
> > "bag_0:bag{:tuple(id:int,class:chararray,name:chararray,begin:int,end:int,probone:chararray,probtwo:chararray)}",
> >  infered is ":Unknown"
> >
> >
> > Java UDF code snippet
> >
> > PopulateBag
> > {
> >
> >                 for (MyItems item : items)
> >                 {
> >
> >
> >                     Tuple output = TupleFactory.getInstance().newTuple(7);
> >
> >
> >                     output.set(0, item.getId());
> >
> >                     output.set(1, item.getClass());
> >
> >                     output.set(2,item.getName());
> >
> >                     output.set(3,item.Begin());
> >
> >                     output.set(4,item.End());
> >
> >                     output.set(5,item.Probabilityone());
> >
> >                     output.set(6,item.Probtwo());
> >
> >                     m_defaultDataBag.add(output);
> >
> >
> >                 }
> >  }
> >
> >      public DefaultDataBag exec(Tuple input) throws IOException {
> >
> >      try
> >            {
> >
> >                this.ParseFile((String)input.get(0));
> >                this.PopulateBag();
> >            return m_defaultDataBag;
> >            } catch (Exception e) {
> >            System.err.println("Failed to process th i/p \n");
> >            return null;
> >        }
> > }
> >
> >
> > Pig Script
> >
> > REGISTER
> > /users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
> > REGISTER /users/p529444/software/pig-0.11.1/parser.jar
> >
> > DEFINE SequenceFileLoader
> > org.apache.pig.piggybank.storage.SequenceFileLoader();
> >
> > A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
> > value: chararray);
> > DESCRIBE A;
> > STORE A into '/scratch/A';
> >
> > AU = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray,
> > name: chararray, begin: int, end: int, probone: chararray, probtwo:
> > chararray)};
> >
> >
> >
> >
> >

Re: Java UDF and incompatible schema

Posted by Pradeep Gollakota <pr...@gmail.com>.

This is most likely because you haven't defined the outputSchema method of
the UDF. The AS keyword merges the schema generated by the UDF with the
user specified schema. If the UDF does not override the method and specify
the output schema, it is considered null and you will not be able to use AS
to override the schema.

Out of curiosity, if each one of your small files describes a user, is
there any reason why you can't use a database (e.g. HBase) to store this
information? It seems like any file based storage may not be the best
solution given my extremely limited knowledge of your problem domain.


On Mon, Nov 4, 2013 at 4:26 PM, Sameer Tilak <ss...@live.com> wrote:

> Hi everyone,
>
> I have written my custom parser and since my files are sm,all I am using
> sequence file for efficiency. Each file in the equence file has info about
> one user and I am parsing that file and I would like to get a bag of tuples
> for every user/file/.  In my Parser class I have implemented exec function
> that will be called for each file/user.  I then gather the info and package
> it as tuples. Each user will generate multiple tuples sine the file is
> quite rich and complex. Is it correct to assume that the  the relation AU
> will contact one bag per user?
>
> When I execute the following script, I get the following error. Any help
> with this would be great!
> ERROR 1031: Incompatable field schema: declared is
>
> "bag_0:bag{:tuple(id:int,class:chararray,name:chararray,begin:int,end:int,probone:chararray,probtwo:chararray)}",
>  infered is ":Unknown"
>
>
> Java UDF code snippet
>
> PopulateBag
> {
>
>                 for (MyItems item : items)
>                 {
>
>
>                     Tuple output = TupleFactory.getInstance().newTuple(7);
>
>
>                     output.set(0, item.getId());
>
>                     output.set(1, item.getClass());
>
>                     output.set(2,item.getName());
>
>                     output.set(3,item.Begin());
>
>                     output.set(4,item.End());
>
>                     output.set(5,item.Probabilityone());
>
>                     output.set(6,item.Probtwo());
>
>                     m_defaultDataBag.add(output);
>
>
>                 }
>  }
>
>      public DefaultDataBag exec(Tuple input) throws IOException {
>
>      try
>            {
>
>                this.ParseFile((String)input.get(0));
>                this.PopulateBag();
>            return m_defaultDataBag;
>            } catch (Exception e) {
>            System.err.println("Failed to process th i/p \n");
>            return null;
>        }
> }
>
>
> Pig Script
>
> REGISTER
> /users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
> REGISTER /users/p529444/software/pig-0.11.1/parser.jar
>
> DEFINE SequenceFileLoader
> org.apache.pig.piggybank.storage.SequenceFileLoader();
>
> A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
> value: chararray);
> DESCRIBE A;
> STORE A into '/scratch/A';
>
> AU = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray,
> name: chararray, begin: int, end: int, probone: chararray, probtwo:
> chararray)};
>
>
>
>
>